Mercurial > repos > wolma > mimodd
changeset 0:6231ae8f87b8
Uploaded
author | wolma |
---|---|
date | Wed, 11 Feb 2015 08:29:02 -0500 |
parents | |
children | a548b3c6ed00 |
files | annotate_variants.xml bamsort.xml cloudmap.xml convert.xml covstats.xml deletion_predictor.xml fileinfo.xml reheader.xml sam_header.xml snap_caller.xml snp_caller_caller.xml snpeff_genomes.xml tool_dependencies.xml varextract.xml vcf_filter.xml |
diffstat | 15 files changed, 1463 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/annotate_variants.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,166 @@ +<tool id="annotate_variants" name="Variant Annotation"> + <description>Predict the effects of SNPs and indels on known genes in the reference genome using SnpEff</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd annotate + + "$inputfile" + + #if $str($annotool.name)=='snpeff': + --genome "${annotool.genomeVersion}" + #if $annotool.ori_output: + --snpeff-out "$snpeff_file" + #end if + #if $annotool.stats: + --stats "$summary_file" + #end if + ${annotool.snpeff_settings.chr} ${annotool.snpeff_settings.no_us} ${annotool.snpeff_settings.no_ds} ${annotool.snpeff_settings.no_intron} ${annotool.snpeff_settings.no_intergenic} ${annotool.snpeff_settings.no_utr} + #if $annotool.snpeff_settings.min_cov: + --minC "${annotool.snpeff_settings.min_cov}" + #end if + #if $annotool.snpeff_settings.min_qual: + --minQ "${annotool.snpeff_settings.min_qual}" + #end if + #if $annotool.snpeff_settings.ud: + --ud "${annotool.snpeff_settings.ud}" + #end if + #end if + + --ofile "$outputfile" + #if $str($formatting.oformat) == "text": + --oformat text + #end if + #if $str($formatting.oformat) == "html": + #if $formatting.formatter_file: + --link "${formatting.formatter_file}" + #end if + #if $formatting.species + --species "${formatting.species}" + #end if + #end if + + #if $str($grouping): + --grouping $grouping + #end if + --verbose + </command> + + <inputs> + <param name="inputfile" type="data" format="vcf" label="vcf inputfile to be annotated" /> + <param name="grouping" type="select" label="Group variants by"> + <option value="">order in the input file</option> + <option value="by_sample">sample</option> + <option value="by_genes">most affected genes</option> + </param> + <conditional name="formatting"> + <param name="oformat" type="select" label="Format of the annotation output file"> + <option value="html">HTML</option> + <option value="text">Tab-separated plain text</option> + </param> + <when value="html"> + <param name="formatter_file" type="data" format="txt" optional="true" label="Optional file with hyperlink formatting instructions" /> + <param name="species" type="text" label="Species" help="Overwrite the species guess from the SnpEff genome, often not necessary" /> + </when> + </conditional> + <conditional name="annotool"> + <param name="name" type="select" label="Use this tool to annotate the input file" help = "Select SnpEff here, if you want to have the vcf input annotated with genomic feature information. Select None if you do not want additional annotation, if you do not have SnpEff installed, or if you have no appropriate SnpEff annotation file for the input."> + <option value="snpeff">SnpEff</option> + <option value="None">None</option> + </param> + <when value="snpeff"> + <param name="genome_list" type="data" format="tabular" label="genome list" /> + <param name="genomeVersion" type="select" label="Genome"> + <options from_dataset="genome_list"> + <column name="name" index="0"/> + <column name="value" index="1"/> + </options> + </param> + <param name="ori_output" type="boolean" checked="true" label="Keep the original SnpEff output" /> + <param name="stats" type="boolean" checked="true" label="Produce a summary file of results" /> + + <conditional name="snpeff_settings"> + <param name="detail_level" type="select" label="SnpEff-specific parameter settings" help="This section lets you specify the detailed parameter settings for the SnpEff tool."> + <option value="default">default settings</option> + <option value="change">change settings</option> + </param> + <when value="default"> + ## default settings for SnpEff + <param name="chr" type="hidden" value=""/> + <param name="min_cov" type="hidden" value=""/> + <param name="min_qual" type="hidden" value=""/> + <param name="no_ds" type="hidden" value=""/> + <param name="no_us" type="hidden" value=""/> + <param name="no_intron" type="hidden" value=""/> + <param name="no_intergenic" type="hidden" value=""/> + <param name="no_utr" type="hidden" value=""/> + <param name="ud" type="hidden" value=""/> + </when> + <when value="change"> + <param name="chr" type="boolean" truevalue="-chr" falsevalue="" checked="false" label="prepend 'chr' to chromosome names, e.g., 'chr7' instead of '7'" /> + <param name="min_cov" type="integer" optional="true" label="minimum coverage (default = not used)" help="do not include variants with a coverage lower than this value"/> + <param name="min_qual" type="integer" optional="true" label="minimum quality (default = not used)" help="do not include variants with a quality lower than this value"/> + <param name="no_ds" type="boolean" label="do not show downstream changes" truevalue="--no-downstream" falsevalue="" checked="false" help="annotation of effects on the downstream region of genes can be suppressed"/> + <param name="no_us" type="boolean" label="do not show upstream changes" truevalue="--no-upstream" falsevalue="" checked="false" help="annotation of effects on the upstream region of genes can be suppressed"/> + <param name="no_intron" type="boolean" label="do not show intron changes" truevalue="--no-intron" falsevalue="" checked="false" help="annotation of effects on introns of genes can be suppressed"/> + <param name="no_intergenic" type="boolean" label="do not show intergenic changes" truevalue="--no-intergenic" falsevalue="" checked="false" help="annotation of effects on intergenic regions can be suppressed"/> + <param name="no_utr" type="boolean" label="do not show UTR changes" truevalue="--no-utr" falsevalue="" checked="false" help="annotation of effects on the untranslated regions of genes can be suppressed"/> + <param name="ud" type="integer" optional="true" label="upstream downstream interval length (default = 5000 bases)" help="specify the upstream/downstream interval length, i.e., variants more than INTERVAL nts from the next annotated gene are considered to be intergenic"/> + </when> + </conditional> + </when> + </conditional> + </inputs> + + <outputs> + <data name="outputfile" format="html" > + <change_format> + <when input="formatting.oformat" value="text" format="tabular"/> + </change_format> + </data> + <data name="snpeff_file" format="vcf" > + <filter>(annotool['name']=="snpeff" and annotool['ori_output'])</filter> + </data> + <data name="summary_file" format="html"> + <filter>(annotool['name']=="snpeff" and annotool['stats'])</filter> + </data> + </outputs> + + <help> +.. class:: infomark + + **What it does** + +The tool turns a variant list in VCF format into a more readable summary table listing variant sites and effects. + +If installed, the variant annotation tool SnpEff can be used transparently to determine the genomic features, e.g., genes or transcripts, affected by the variants. + +Use of this feature requires that you have an appropriate SnpEff genome file installed on the host machine. You can use the *List installed SnpEff genomes* tool to generate a list of all available SnpEff genomes. +This list can then be used (by selecting the dataset as the *genome list*) to populate the *genome* dropdown menu, from which you can select the SnpEff genome file to be used for the annotation. + +As output file formats HTML or plain text are supported. +In HTML mode, variant positions and/or affected genomic features can be turned into hyperlinks to corresponding views in web-based genome browsers and databases. + +The behavior of this feature depends on: + +1) Recognition of the species that is analyzed + + You can declare the species you are working with using the *Species* text field. + If you are not declaring the species explicitly, but are choosing SnpEff for effect annotation, the tool will usually be able to auto-detect the species from the SnpEff genome you are using. + If no species gets assigned in either way, no hyperlinks will be generated and the html output will look essentially like plain text. + +2) Available hyperlink formatting rules for this species + + When the species has been recognized, the tool checks if you have selected an *optional file with hyperlink formatting instructions*. + If you did and that file contains an entry matching the recognized species, that entry will be used as a template to construct the hyperlinks. + If no matching entry is found in the file, an error will be raised. + + If you did not supply a hyperlink formatting instruction file, the tool will consult an internal lookup table to see if it finds default rules for the construction of the hyperlinks for the species. + If not, no hyperlinks will be generated and the html output will look essentially like plain text. + + **TIP:** + MiModD's internal hyperlink formatting lookup tables are maintained and growing with every new version, but since weblinks are changing frequently as well, it is possible that you will encounter broken hyperlinks for your species of interest. In such a case, you can resort to two things: `tell us about the problem`_ to make sure it gets fixed in the next release and, in the meantime, use a custom file with hyperlink formatting instructions to overwrite the default entry for your species. + +.. _tell us about the problem: mailto:mimodd@googlegroups.com + </help> +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/bamsort.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,35 @@ +<tool id="bamsort" name="Sort BAM file"> + <description>Sort a BAM file by coordinates (or names) of the mapped reads</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd sort "$inputfile" -o "$output" --oformat $oformat $by_name + </command> + + <inputs> + <param name="inputfile" type="data" format="bam" label="Input file to sort" /> + <param name="by_name" type="boolean" truevalue = "-n" falsevalue ="" label="Sort by read names instead of coordinates" checked = "false" help="A less common option, but necessary, e.g., if you want to re-align sorted output from a previous run of the Snap Align Tool." /> + <param name="oformat" type="boolean" truevalue = "sam" falsevalue = "bam" label = "Output in uncompressed SAM format" checked = "false" /> + </inputs> + + <outputs> + <data name="output" format="bam" label="Sorted output from MiModd ${tool.name} on ${on_string}"> + <change_format> + <when input="oformat" value="sam" format="sam" /> + </change_format> + </data> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool sorts a BAM file of aligned reads, typically by the reference genome coordinates that the reads have been mapped to. + +Coordinate-sorted input files are expected by most downstream MiModD tools, but note that the *SNAP Read Alignment* produces coordinate-sorted output by default and it is only necessary to sort files that come from other sources or from *SNAP Read Alignment* jobs with a custom sort order. + +The option *Sort by read names instead of coordinates* is useful if you want to re-align coordinate-sorted paired-end data. In *paired-end mode*, the *SNAP Read Alignment* tool expects the reads in the input file to be arranged in read pairs, i.e., the forward read information of a pair must be followed immediately by its reverse mate information, which is typically not the case in coordinate-sorted files. Resorting such files by read names fixes this problem. + +</help> +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cloudmap.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,70 @@ +<tool id="cloudmap_prepare" name="Prepare variant data for mapping"> + <description>with the CloudMap series of tools.</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd cloudmap "$ifile" ${run.mode} + + #if $str($run.mode) != "SVD": + "${run.refsample}" + #end if + + "$sample" -o "$ofile" + + #if $seqdict: + -s "$dictfile" + #end if + </command> + + <inputs> + <param name="ifile" type="data" format="vcf" label="vcf input file" /> + <conditional name="run"> + <param name="mode" type="select" label="CloudMap analysis to prepare data for"> + <option value="SVD">EMS Variant Density Mapping</option> + <option value="VAF">Variant Discovery / Hawaiian Variant Mapping</option> + </param> + <when value="SVD"> + <param name="refsample" type="hidden" value="None" /> + </when> + <when value="VAF"> + <param name="refsample" type="text" label="name of the reference sample" help="the sample that provides mapping strain variants" /> + </when> + </conditional> + <param name="sample" type="text" label="subject sample name" help="the sample to perform CloudMap mapping for" /> + <param name="seqdict" type="boolean" checked="true" label="Generate species configuration file for CloudMap" /> + + </inputs> + + <outputs> + <data name="ofile" format="vcf" label="CloudMap-ready ${run.mode} File from ${on_string}" /> + <data name="dictfile" format="tabular" label="Species Configuration File for CloudMap from ${on_string}"> + <filter>seqdict</filter> + </data> + </outputs> + + <help> +.. class:: infomark + + **What it does** + +The purpose of this tool is to provide compatibility of the MiModD analysis workflow with the external `CloudMap`_ *EMS Variant Density Mapping*, *Variant Discovery Mapping* and *Hawaiian Variant Mapping* tools. + +These tools complement MiModD by providing easily interpreted visualizations of mapping-by-sequencing analysis workflows. + +The tool converts a VCF file as generated by the *Extract Variant Sites* or *VCF Filter* tools to the format expected by the *CloudMap* series of tools. + +Optionally, it also extracts the chromosome names and sizes and reports them in the *CloudMap* *species configuration file* format. +Such a file is required as input to the current versions of the *CloudMap* *Hawaiian* and *Variant Density* mapping tools, if you are working with a species other than the natively supported ones (i.e., other than *C. elegans* or *A. thaliana*). + +To use the output datasets of the tool with *CloudMap*, you only have to upload them to any public Galaxy server that hosts *CloudMap* like, e.g., the main Galaxy server at https://usegalaxy.org . + +.. class:: warningmark + + EMS Variant Density Mapping is currently limited to *C. elegans* and other species with six chromosomes on the *CloudMap* side. + +More information on combining MiModD and CloudMap in mapping-by-sequencing analyses can be found in the `corresponding section of the MiModD User Guide`_. + +.. _CloudMap: https://usegalaxy.org/u/gm2123/p/cloudmap +.. _corresponding section of the MiModD User Guide: http://mimodd.readthedocs.org/en/latest/cloudmap.html + + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/convert.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,133 @@ +<tool id="convert" name="Convert"> + <description>between different sequence data formats</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd convert + + #for $i in $mode.input_list + "${i.file1}" + #if $str($mode.iformat) in ("fastq_pe", "gz_pe"): + "${i.file2}" + #end if + #end for + #if $str($mode.header) != "None": + --header "$(mode.header)" + #end if + --ofile "$outputname" + --iformat $(mode.iformat) + --oformat $(mode.oformat) + </command> + + <inputs> + <conditional name="mode"> + <param name="iformat" type="select" label="input file format" help="Your choice will update the interface to display further choices appropriate for your type of input data."> + <option value="fastq">fastq: single-end (one file)</option> + <option value="fastq_pe">fastq: paired-end (two files)</option> + <option value="gz">gzip compressed fastq: single-end (one file)</option> + <option value="gz_pe">gzip compressed fastq: paired-end (two files)</option> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <when value="fastq"> + <param name="oformat" type="select" label="output file format"> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <repeat name="input_list" title="fastq input dataset" default="1" min="1"> + <param name="file1" format="fastq" type="data" label="inputfile"/> + </repeat> + <param name="header" type="data" format="sam" label="Use Header File" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file."/> + </when> + <when value="fastq_pe"> + <param name="oformat" type="select" label="output file format"> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <repeat name="input_list" title="fastq input datasets" default="1" min="1"> + <param format="fastq" name="file1" type="data" label="inputfile with the first set of reads of paired-end data"/> + <param format="fastq" name="file2" type="data" label="inputfile with the second set of reads of paired-end data"/> + </repeat> + <param name="header" type="data" format="sam" label="Use Header File" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file."/> + </when> + <when value="gz"> + <param name="oformat" type="select" label="output file format"> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <repeat name="input_list" title="fastq.gz input dataset" default="1" min="1"> + <param name="file1" format="data" type="data" label="inputfile"/> + </repeat> + <param name="header" type="data" format="sam" label="Use Header File" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file."/> + </when> + <when value="gz_pe"> + <param name="oformat" type="select" label="output file format"> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <repeat name="input_list" title="fastq.gz input datasets" default="1" min="1"> + <param format="data" name="file1" type="data" label="inputfile with the first set of reads of paired-end data"/> + <param format="data" name="file2" type="data" label="inputfile with the second set of reads of paired-end data"/> + </repeat> + <param name="header" type="data" format="sam" label="Use Header File" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file."/> + </when> + <when value="sam"> + <param name="oformat" type="select" label="output file format"> + <option value="bam">bam</option> + </param> + <repeat name="input_list" title="sam input dataset" default="1" min="1" max="1"> + <param name="file1" format="sam" type="data" label="inputfile"/> + </repeat> + <param name="header" type="hidden" value="None"/> + </when> + <when value="bam"> + <param name="oformat" type="select" label="output file format"> + <option value="sam">sam</option> + </param> + <repeat name="input_list" title="bam input dataset" default="1" min="1" max="1"> + <param name="file1" format="bam" type="data" label="inputfile"/> + </repeat> + <param name="header" type="hidden" value="None"/> + </when> + </conditional> + </inputs> + + <outputs> + <data name="outputname" format="bam" label="Converted reads from MiModd ${tool.name} on ${on_string}"> + <change_format> + <when input="mode.oformat" value="sam" format="sam" /> + </change_format> + </data> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool converts between different file formats used for storing next-generation sequencing data. + +As input file types it can handle uncompressed or gzipped fastq, SAM or BAM format, which it can convert to SAM or BAM format. + +**Notes:** + +1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to align gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. + +2) The tool can convert fastq files representing data from paired-end sequencing runs to appropriate SAM/BAM format provided that the mate information is split over two fastq files in corresponding order. + + **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. + +3) Merging partial fastq (or gzipped fastq) files into a single SAM/BAM file is supported both for single-end and paired-end data. Simply add additional input datasets and select the appropriate files (pairs of files in case of paired-end data). + + Concatenation of SAM/BAM file during conversion is currently not supported. + +4) For input in fastq format a SAM header file providing run metadata **has to be specified**. The information in this file will be used as the header data of the new SAM/BAM file. You can use the *NGS Run Annotation* tool to generate a new header file for your data. + + For input in SAM/BAM format the tool will simply copy the existing header data to the new file. To modify the header of an existing SAM/BAM file, use the *Reheader BAM file* tool instead. + +.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 +.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy +.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest + +</help> +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/covstats.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,27 @@ +<tool id="coverage_stats" name="Coverage Statistics"> + <description>Calculate coverage statistics for a BCF file as generated by the Variant Calling tool</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd covstats "$ifile" --ofile "$output_vcf" + </command> + + <inputs> + <param name="ifile" type="data" format="bcf" label="BCF input file" help="Use the Variant Calling tool to generate input for this tool."/> + </inputs> + <outputs> + <data name="output_vcf" format="tabular" label="Coverage Statistics for ${on_string}"/> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool takes as input a BCF file produced by the *Variant Calling* tool, and calculates per-chromosome read coverage from it. + +.. class:: warningmark + + The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/deletion_predictor.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,61 @@ +<tool id="deletion_prediction" name="Deletion Prediction for paired-end data"> + <description>Predicts deletions in one or more aligned read samples based on coverage of the reference genome and on insert sizes</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd delcall + #for $l in $list_input + "${l.bamfile}" + #end for + "$covfile" -o "$outputfile" + --max-cov "$max_cov" --min-size "$min_size" $include_uncovered $group_by_id --verbose + </command> + + <inputs> + <repeat name="list_input" title="Aligned reads input source" default="1" min="1"> + <param name="bamfile" type="data" format="bam" label="input BAM file" /> + </repeat> + <param name="covfile" type="data" format="bcf" label="BCF variant call file to extract coverage from" help="Use the Variant Calling tool to generate this file."/> + <param name="group_by_id" type="boolean" label="group reads based on read group id only" truevalue="-i" falsevalue="" checked="true" help="If selected, reads from different read groups will be treated strictly separate. If turned off, read groups with identical sample names are used together for identifying uncovered regions, but are still treated separately for the prediction of deletions." /> + <param name="include_uncovered" type="boolean" label="include low-coverage regions" truevalue="-u" falsevalue="" checked="true" help="If selected, regions that fulfill the coverage criteria below, but are not statistically significant deletions, will be included in the output." /> + <param name="max_cov" type="integer" value="0" label="maximal coverage allowed inside a low-coverage region (default: 0)" help="The maximal coverage at a site allowed to consider it as part of a low-coverage region" /> + <param name="min_size" type="integer" value="100" label="minimal deletion size (default: 100)" help="A low-coverage region must consist of at least this number of consecutive bases below the maximal coverage to consider it in further analyses."/> + </inputs> + + <outputs> + <data name="outputfile" format="gff" /> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool predicts deletions from paired-end data in a two-step process: + +1) It finds regions of low-coverage, i.e., candidate regions for deletions, by scanning a BCF file produced by the *Variant Calling* tool. + + The *maximal coverage allowed inside a low-coverage region* and the *minimal deletion size* parameters are used at this step to define what is considered a low-coverage region. + + .. class:: warningmark + + The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. + +2) It assesses every low-coverage region statistically for evidence of it being a real deletion. **This step requires paired-end data** since it relies on shifts in the distribution of read pair insert sizes around real deletions. + +By default, the tool only reports Deletions, i.e., the subset of low-coverage regions that pass the statistical test. +If *include low-coverage regions* is selected, regions that failed the test will also be reported. + +With *group reads based on read group id only* selected, as it is by default, grouping of reads into samples is done strictly based on their read group IDs. +With the option deselected, grouping is done based on sample names in the first step of the analysis, i.e. the reads of all samples with a shared sample name are used to identify low-coverage regions. +In the second step, however, reads will be regrouped by their read group IDs again, i.e. the statistical assessment for real deletions is always done on a per read group basis. + +**TIP:** +Deselecting *group reads based on read group id only* can be useful, for example, if you have both paired-end and single-end sequencing data for the same sample. + +In this case, the two sets of reads will usually share a common sample name, but differ in their read groups. +With grouping based on sample names, the single-end data can be used together with the paired-end data to identify low-coverage regions, thus increasing overall coverage and reliability of this step. +Still, the assessment of deletions will use only the paired-end data (auto-detecting that the single-end reads do not provide insert size information). + +</help> + +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/fileinfo.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,34 @@ +<tool id="fileinfo" name="Retrieve File Information"> + <description>for supported data formats.</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd info "$ifile" -o "$outputfile" --verbose --oformat $oformat + </command> + + <inputs> + <param name="ifile" type="data" format="bam,sam,vcf,bcf,fasta" label="input file" /> + <param name="oformat" type="select" label="output format"> + <option value="txt">text</option> + <option value="html">html</option> + </param> + </inputs> + + <outputs> + <data name="outputfile" format="txt" label="Sample Info on ${on_string}"> + <change_format> + <when input="oformat" value="html" format="html"/> + </change_format> + </data> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool inspects the input file and generates a report summarizing its contents. + +It autodetects and works with most file formats produced by MiModD, i.e., **SAM / BAM, vcf / bcf and fasta**, and produces a standardized report for all of them. + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/reheader.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,202 @@ +<tool id="reheader" name="Reheader BAM file"> + + <description>From a BAM file generate a new file with the original header (if any) replaced or modified by that found in a second SAM file</description> + <version_command>mimodd version -q</version_command> + <command> + #if ($str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form") or $str($co.treat_co) != "ignore": + mimodd header + #if $str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form": + #for $rginfo in $rg.rginfo.rg + #if $str($rginfo.source_id): + --rg-id "${rginfo.source_id}" + #end if + #if $str($rginfo.rg_sm): + --rg-sm "${rginfo.rg_sm}" + #end if + #if $str($rginfo.rg_cn): + --rg-cn "${rginfo.rg_cn}" + #else: + --rg-cn "" + #end if + #if $str($rginfo.rg_ds): + --rg-ds "${rginfo.rg_ds}" + #else: + --rg-ds "" + #end if + #if $str($rginfo.rg_date): + --rg-dt "${rginfo.rg_date}" + #else: + --rg-dt "" + #end if + #if $str($rginfo.rg_lb): + --rg-lb "${rginfo.rg_lb}" + #else: + --rg-lb "" + #end if + #if $str($rginfo.rg_pl): + --rg-pl "${rginfo.rg_pl}" + #else: + --rg-pl "" + #end if + #if $str($rginfo.rg_pi): + --rg-pi "${rginfo.rg_pi}" + #else: + --rg-pi "" + #end if + #if $str($rginfo.rg_pu): + --rg-pu "${rginfo.rg_pu}" + #else: + --rg-pu "" + #end if + #end for + #end if + #if $str($co.treat_co) != "ignore": + --co + #for $comment in $co.coinfo + #if $str($comment.line): + "${comment.line}" + #end if + #end for + #end if + | + #end if + mimodd reheader "$inputfile" --sq ignore + --rg ${rg.treat_rg} + #if $str($rg.treat_rg) != "ignore": + #if $str($rg.rginfo.source) == "from_file": + "${rg.rginfo.data}" + #else: + - + #end if + #for $rgmapping in $rg.rginfo.rg + #if $str($rgmapping.source_id) and $str($rgmapping.rg_id): + "$str($rgmapping.source_id)" : "$str($rgmapping.rg_id)" + #end if + #end for + #end if + + --co ${co.treat_co} + #if $str($co.treat_co) != "ignore": + - + #end if + + #set $restr = "" + #for $rename in $rg_renaming + #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '"') + #end for + #if $restr + --rgm $restr + #end if + + #set $restr = "" + #for $rename in $sq_renaming + #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '"') + #end for + #if $restr + --sqm $restr + #end if + + -o "$output" + </command> + + <macros> + <macro name="getreadgroupinfo"> + <conditional name="rginfo"> + <param name="source" type="select" label="source of new read-group information" help=""> + <option value="from_file">existing SAM file</option> + <option value="from_form">input form</option> + </param> + <when value="from_file"> + <param name="data" type="data" format="sam" label="read-group template file in SAM format" help="use the read group information found in this file" /> + <repeat name="rg" title="custom read-group mapping" default="0" min="0" help="read-group information found in the input file, by default, gets updated / replaced with information from template file read-groups with matching IDs. Alternatively, you may specify explicit read-group mappings below."> + <param name="source_id" type="text" label="modify input file information for read-group ID (will create the read-group if it does not exist)" /> + <param name="rg_id" type="text" label="with template file information for read-group ID" /> + </repeat> + </when> + <when value="from_form"> + <repeat name="rg" title="new read-group info" default="1" min="1"> + <param name="source_id" type="text" label="read-group ID (will create the read-group if it does not exist)" help="required field" /> + <param name="rg_id" type="hidden" value="" /> + <param name="rg_sm" type="text" label="sample name" help="required field" /> + <param name="rg_ds" type="text" label="description" /> + <param name="rg_date" type="text" label="date (YY-MM-DD format) the run was produced" /> + <param name="rg_cn" type="text" label="name of sequencing center" /> + <param name="rg_lb" type="text" label="read-group library" /> + <param name="rg_pl" type="text" label="platform/technology used to produce the reads" /> + <param name="rg_pi" type="text" label="predicted median insert size" /> + <param name="rg_pu" type="text" label="platform unit; unique identifier" /> + </repeat> + </when> + </conditional> + </macro> + </macros> + + <inputs> + + <param name="inputfile" type="data" format="bam" label="input file in BAM format" help="the file to reheader." /> + + <conditional name="rg"> + <param name="treat_rg" type="select" label="modify read-group information ?" help="Replace mode will ignore ALL existing read group information in the input file and use ONLY template information, Update mode will COPY existing input file information and UPDATE it with template information; choose No, ... to leave read-group information alone."> + <option value="ignore">No, do not change read-groups.</option> + <option value="update">Yes, update existing information</option> + <option value="replace">Yes, replace existing information</option> + </param> + <when value="update"> + <expand macro="getreadgroupinfo" /> + </when> + <when value="replace"> + <expand macro="getreadgroupinfo" /> + </when> + </conditional> + + <conditional name="co"> + <param name="treat_co" type="select" label="modify comments in the input file ?" help=""> + <option value="ignore">No, do not change comments.</option> + <option value="update">Yes, append new comments to existing ones</option> + <option value="replace">Yes, replace all existing comments</option> + </param> + <when value="update"> + <repeat name="coinfo" title="comment line" default="0" min="0"> + <param name="line" type="text" size="80" /> + </repeat> + </when> + <when value="replace"> + <repeat name="coinfo" title="comment line" default="0" min="0"> + <param name="line" type="text" size="80" /> + </repeat> + </when> + </conditional> + + <repeat name="rg_renaming" title="rename read-group" default="0" min="0" help="Warning: changing read-group IDs may increase job runtime substantially."> + <param name="from" type="text" size="30" label="old name" help="as it appears in the current input file header"/> + <param name="to" type="text" size="30" label="new name" /> + </repeat> + + <repeat name="sq_renaming" title="rename sequence" default="0" min="0" help="Warning: changing sequence names may increase job runtime substantially."> + <param name="from" type="text" size="30" label="old name" help="as it appears in the current input file header"/> + <param name="to" type="text" size="30" label="new name" /> + </repeat> + + </inputs> + + <outputs> + <data name="output" format="bam" label="(Re)headered bam file from MiModd ${tool.name} on ${on_string}"> + </data> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool generates a copy of the BAM input file with a modified header (i.e., metadata). + +It can update or replace read-group information (i.e., information about the samples in the file), add or replace comment lines, and rename reference sequences declared in the header. + +The tool ensures that the resulting BAM file is valid and can be further processed by other MiModD tools and standard software like samtools. It aborts with an error message if a valid BAM file cannot be generated with the user-specified settings. + +The template information used to modify or replace the input file metadata is provided through forms or, in the case of read-group information, can be taken from an existing SAM file as can be generated, for example, with the *NGS Run Annotation* tool. + +</help> +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/sam_header.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,125 @@ +<tool id="ngs_run_annotation" name="NGS Run Annotation"> + <description>Create a SAM format header from run metadata for sample annotation.</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd header + + --rg-id "$rg_id" + --rg-sm "$rg_sm" + + #if $str($rg_cn): + --rg-cn "$rg_cn" + #end if + #if $str($rg_ds): + --rg-ds "$rg_ds" + #end if + #if $str($rg_date): + --rg-dt "$rg_date" + #end if + #if $str($rg_lb): + --rg-lb "$rg_lb" + #end if + #if $str($rg_pl): + --rg-pl "$rg_pl" + #end if + #if $str($rg_pi): + --rg-pi "$rg_pi" + #end if + #if $str($rg_pu): + --rg-pu "$rg_pu" + #end if + + --ofile "$outputfile" + + </command> + + <inputs> + <param name="rg_id" type="text" size="80" label="read-group ID (required)"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\""/> + </mapping> + </sanitizer> + </param> + <param name="rg_sm" type="text" size="80" label="sample name (required)"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\""/> + </mapping> + </sanitizer> + </param> + <param name="rg_ds" type="text" size="80" label="description"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\""/> + </mapping> + </sanitizer> + </param> + <param name="rg_date" type="text" label="date (YYYY-MM-DD) the run was produced" /> + <param name="rg_cn" type="text" size="80" label="name of sequencing center"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\""/> + </mapping> + </sanitizer> + </param> + <param name="rg_lb" type="text" size="80" label="read-group library"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\""/> + </mapping> + </sanitizer> + </param> + <param name="rg_pl" type="text" label="platform/technology used to produce the reads" /> + <param name="rg_pi" type="text" label="predicted median insert size" /> + <param name="rg_pu" type="text" size="80" label="platform unit; unique identifier"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\""/> + </mapping> + </sanitizer> + </param> + </inputs> + + <outputs> + <data name="outputfile" format="sam" label="${rg_sm} (${rg_id}) header information from MiModd ${tool.name} on ${on_string}"/> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +This tool takes the user-provided information about a next-generation sequencing run and constructs a valid header in the SAM file format from it. + +The result file can be used by the tools *Convert* and *Reheader* or in the *SNAP Read Alignment* step to add run metadata to sequenced reads files (or to overwrite pre-existing information). + +**Note:** + +**MiModD requires run metadata for every input file at the Alignment step !** + +**Tip:** + +While you can do Alignments from fastq file format by providing a custom header file directly to the *SNAP Read Alignment* tool, we **recommend** you to first convert all input files to and archive all datasets in SAM/BAM format with appropriate header information prior to any downstream analysis. Although a bit more time-consuming, this practice protects against information loss and ensures that the input datasets will remain useful for others in the future. + +</help> +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snap_caller.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,232 @@ +<tool id="read_alignment" name="SNAP Read Alignment"> + <description>Map sequence reads to a reference genome using SNAP</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd snap-batch -s + ## SNAP calls (considering different cases) + + #for $i in $datasets + "snap ${i.mode_choose.mode} '$ref_genome' + #if $str($i.mode_choose.mode) == "paired" and $str($i.mode_choose.input.iformat) in ("fastq", "gz"): +'${i.mode_choose.input.ifile1}' '${i.mode_choose.input.ifile2}' + #else: +'${i.mode_choose.input.ifile}' + #end if +--ofile '$outputfile' --iformat ${i.mode_choose.input.iformat} --oformat $oformat +--idx-seedsize '$set.seedsize' +--idx-slack '$set.slack' --maxseeds '$set.maxseeds' --maxhits '$set.maxhits' --clipping=$set.clipping --maxdist '$set.maxdist' --confdiff '$set.confdiff' --confadapt '$set.confadpt' + #if $i.mode_choose.input.header: +--header '${i.mode_choose.input.header}' + #end if + #if $str($i.mode_choose.mode) == "paired": +--spacing '$set.sp_min' '$set.sp_max' + #end if + #if $str($set.selectivity) != "off": +--selectivity '$set.selectivity' + #end if + #if $str($set.filter_output) != "off": +--filter-output $set.filter_output + #end if + #if $str($set.sort) != "off": +--sort $set.sort + #end if + #if $str($set.mmatch_notation) == "general": +-M + #end if +--max-mate-overlap '$set.max_mate_overlap' +--verbose +" + #end for + </command> + + <inputs> + ## mandatory arguments (and mode-conditionals) + + <param name="ref_genome" type="data" format="fasta" label="reference genome" help="The fasta reference genome that SNAP should align reads against."/> + + <repeat name="datasets" title="datasets" default="1" min="1"> + <conditional name="mode_choose"> + <param name="mode" type="select" label="choose mode" help="Reads obtained from single-end sequencing runs should be aligned in 'single' mode, paired-end reads in 'paired' mode. **WARNING**: if the read input file is in SAM/BAM format, the current version of this tool will **not** verify the mode and may produce erroneous alignments with wrong settings!"> + <option value="single">single-end</option> + <option value="paired">paired-end</option> + </param> + + <when value="single"> + <conditional name="input"> + <param name="iformat" type="select" label="input file format"> + <option value="bam">BAM</option> + <option value="sam">SAM</option> + <option value="gz">gz</option> + <option value="fastq">fastq</option> + </param> + <when value="bam"> + <param name="ifile" type="data" format="bam" label="input file"/> + <param name="header" type="data" optional="true" format="sam" label="custom header file" /> + </when> + <when value="sam"> + <param name="ifile" type="data" format="sam" label="input file"/> + <param name="header" type="data" optional="true" format="sam" label="custom header file" /> + </when> + <when value="gz"> + <param name="ifile" type="data" label="input file"/> + <param name="header" type="data" format="sam" label="header file" /> + </when> + <when value="fastq"> + <param name="ifile" type="data" format="fastq" label="input file"/> + <param name="header" type="data" format="sam" label="header file" /> + </when> + </conditional> + </when> + <when value="paired"> + <conditional name="input"> + <param name="iformat" type="select" label="input file format"> + <option value="bam">BAM</option> + <option value="sam">SAM</option> + <option value="gz">gz</option> + <option value="fastq">fastq</option> + </param> + <when value="bam"> + <param name="ifile" type="data" format="bam" label="input file"/> + <param name="header" type="data" optional="true" format="sam" label="custom header file" /> + </when> + <when value="sam"> + <param name="ifile" type="data" format="sam" label="input file"/> + <param name="header" type="data" optional="true" format="sam" label="custom header file" /> + </when> + <when value="fastq"> + <param name="ifile1" type="data" format="fastq" label="inputfile with the first set of reads of paired-end data"/> + <param name="ifile2" type="data" format="fastq" label="inputfile with the second set of reads of paired-end data"/> + <param name="header" type="data" format="sam" label="header file" help="required" /> + </when> + <when value="gz"> + <param name="ifile1" type="data" label="inputfile with the first set of reads of paired-end data"/> + <param name="ifile2" type="data" label="inputfile with the second set of reads of paired-end data"/> + <param name="header" type="data" format="sam" label="header file" help="required" /> + </when> + </conditional> + </when> + </conditional> + </repeat> + + <param name="oformat" type="select" label="output file format"> + <option value="bam">BAM</option> + <option value="sam">SAM</option> + </param> + + ## optional arguments + + <conditional name="set"> + <param name="settings_mode" type="select" label="further parameter settings" help="This section lets you specify the detailed parameter settings for the SNAP aligner. Only change them if you know what you are doing, i.e., read the documentation first."> + <option value="default">default settings</option> + <option value="change">change settings</option> + </param> + + ## default settings + + <when value="default"> + <param name="seedsize" type="hidden" value="20"/> + <param name="slack" type="hidden" value="0.3"/> + <param name="sp_min" type="hidden" value="100"/> + <param name="sp_max" type="hidden" value="10000"/> + <param name="maxdist" type="hidden" value="8"/> + <param name="confdiff" type="hidden" value="2"/> + <param name="confadpt" type="hidden" value="7"/> + + <param name="maxseeds" type="hidden" value="25"/> + <param name="maxhits" type="hidden" value="250"/> + <param name="clipping" type="hidden" value="++"/> + + <param name="selectivity" type="hidden" value="off"/> + <param name="filter_output" type="hidden" value="off"/> + <param name="sort" type="hidden" value="0"/> + <param name="mmatch_notation" type="hidden" value="general"/> + <param name="max_mate_overlap" type="hidden" value="0" /> + </when> + + ## change settings + + <when value="change"> + <param name="seedsize" type="integer" value="20" label="seed size (default: 20)" help="Length of the seeds used in the reference genome hash table (SNAP index option -s)."/> + <param name="slack" type="float" value="0.3" label="hash table slack size (default: 0.3)" help="Corresponds to the -h option of SNAP index."/> + + ## paired-end specific options + <param name="sp_min" type="integer" value="100" label="minimum spacing to allow between paired ends (default: 100)" help="Corresponds to the first value of the SNAP option -s."/> + <param name="sp_max" type="integer" value="10000" label="maximum spacing to allow between paired ends (default: 10000)" help="Corresponds to the second value of the SNAP option -s."/> + <param name="max_mate_overlap" type="float" value="0" label="Maximal overlap between the reads in a pair (as a fraction of their combined length; default: 0, no overlap allowed)" help="If the reads of a read pair overlap by more than this fraction of their combined length, they are filtered out" /> + + <param name="maxdist" type="integer" value="8" label="edit distance (default: 8)" help="maximum edit distance allowed per read or pair (SNAP option -d); higher values allow more divergent alignments to be found, but increase the rate of misalignments."/> + <param name="maxhits" type="integer" value="250" label="maximum hits per seed (default: 250)" help="Maximum hits to consider per seed (SNAP option -h); don't use a seed region in the alignment process if it matches more than maxhits regions in the reference genome. Higher values reduce the rate of misalignments, but reduce performance."/> + <param name="confdiff" type="integer" value="2" label="confidence threshold (default: 2)" help="Confidence threshold (SNAP option -c); the minimum edit distance difference between two alternate alignments required to reject the poorer alignment as suboptimal; higher values increase the rate of ambiguously aligned reads."/> + <param name="confadpt" type="integer" value="7" label="adaptive confdiff behaviour (default: 7)" help="Specifies how many seeds of a read may be ignored (based on the maximum hits value above) before the confidence threshold above gets increased by one for that read; helps fine-tuning alignment accuracy in repetitive regions of the genome."/> + <param name="maxseeds" type="integer" value="25" label="maximum seeds per read (default: 25)" help="Number of seeds to use per read (SNAP option -n) when trying to match it to the reference genome; higher numbers will increase the rate of aligned reads and reduce the rate of misalignments, but will reduce performance."/> + <param name="clipping" type="select" label="read clipping (default: from back and front)" help="Specifies from which end of a read low-quality bases should be clipped (SNAP option -Cxx)"> + <option value="++">from back and front</option> + <option value="-+">from back only</option> + <option value="+-">from front only</option> + <option value="--">no clipping</option> + </param> + <param name="selectivity" type="integer" value="1" label="selectivity (default: 1)" help="randomly choose 1/selectivity of the reads to score (SNAP option -S). The tool uses the default of 1 (or a 0 setting) to indicate that all reads should be worked with." /> + <param name="filter_output" type="select" label="filter output (default: no filtering)" help="filter output (SNAP option -F for certain classes of reads."> + <option value="off">no filtering</option> + <option value="a">aligned only</option> + <option value="s">single-aligned only</option> + <option value="u">unaligned only</option> + </param> + <param name="sort" type="select" label="output sorting (default: sort by read coordinates)" help="Sort the output file by alignment location (SNAP option --so)."> + <option value="0">sort by read coordinates</option> + <option value="off">no sorting</option> + </param> + <param name="mmatch_notation" type="select" label="CIGAR symbols for alignment matches/mismatches (default: M notation)" help="Indicates whether CIGAR strings in the generated SAM/BAM file should use M (alignment match) rather than = and X (sequence (mis-)match). Warning: Downstream variant calling based on samtools currently relies on the old-style M notation!!" > + <option value="general">use M for both matches and mismatches</option> + <option value="differentiate">use = for matches, X for mismatches</option> + </param> + </when> + </conditional> +</inputs> + +<outputs> + <data name="outputfile" format="bam" label="Aligned reads from MiModd ${tool.name} on ${on_string}"> + <change_format> + <when input="oformat" value="sam" format="sam"/> + </change_format> + </data> +</outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool aligns the sequenced reads in an arbitrary number of input datasets against a common reference genome and stores the results in a single, possibly multi-sample output file. It supports a variety of different sequenced reads input formats, i.e., SAM, BAM, fastq and gzipped fastq, and both single-end and paired-end data. + +Internally, the tool uses the ultrafast, hashtable-based aligner SNAP (http://snap.cs.berkeley.edu), hence its name. + +**Notes:** + +1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to align gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. + +2) To use paired-end fastq data with the tool the read mate information needs to be split over two fastq files in corresponding order. + + **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. + +3) The tool supports the alignment of reads from the same sequencing run, but distributed across several input files. + + Generally, it expects the reads from each input dataset to belong to one read-group and will abort with an error message if any input dataset declares more than one read group or sample names in its header. Different datasets, however, are allowed to contain reads from the same read-group (as indicated by matching read-group IDs and sample names in their headers), in which case the reads will be combined into one group in the output. + +4) Read-group information is required for every input dataset! + + We generally recommend to store NGS datasets in SAM/BAM format with run metadata stored in the file header. You can use the *NGS Run Annotation* and *Convert* tools to convert data in fastq format to SAM/BAM with added run information. + + While it is not our recommended approach, you can, if you prefer it, align reads from fastq files or SAM/BAM files without header read-group information. To do so, you **must** specify a SAM file that provides the missing information in its header along with the input dataset. You can generate a SAM header file with the *NGS Run Annotation* tool. + + Optionally, a SAM header file can also be used to replace existing read-group information in a headered SAM/BAM input file. This can be used to resolve read-group ID conflicts between multiple input files at tool runtime. + +4) Currently, you cannot configure aligner-specific options separately for specific input files from within this Galaxy tool. If you need this advanced level of control, you should use the command line tool ``mimodd snap-batch``. + +.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 +.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy +.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest + +</help> +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snp_caller_caller.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,62 @@ +<tool id="variant_calling" name="Variant Calling"> + <description>From a reference and aligned reads generate a BCF file with position-specific variant likelihoods and coverage information</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd varcall + + "$ref_genome" + #for $l in $list_input + "${l.inputfile}" + #end for + --ofile "$output_vcf" + --depth "$depth" + $group_by_id + $no_md5_check + --verbose + --quiet + </command> + + <inputs> + <param name="ref_genome" type="data" format="fasta" label="reference genome" /> + <repeat name="list_input" title="Aligned reads input source" default="1" min="1"> + <param name="inputfile" type="data" format="bam" label="input file" /> + </repeat> + <param name="group_by_id" type="boolean" label="group reads based on read group id only" truevalue="-i" falsevalue="" checked="true" help="If selected, this option ensures that only the read group id (but not the sample name) is considered in grouping reads in the input file(s). If turned off, read groups with identical sample names are automatically pooled and analyzed together even if they come from different NGS runs." /> + <param name="no_md5_check" type="boolean" label="turn off md5 sum verification" truevalue="-x" falsevalue="" checked="false" help="leave turned on to avoid accidental variant calling against a wrong reference genome version (see the tool help below)." /> + <param name="depth" type="integer" value="250" label="maximum per-BAM depth (default: 250)" help="to avoid excessive use of memory"/> + </inputs> + + <outputs> + <data name="output_vcf" format="bcf" label="Variant Calls from MiModd Variant Calling on ${on_string}"/> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool transforms the read-centered information of its aligned reads input files into position-centered information. + +**It produces a BCF file that serves as the basis for all further variant analyses with MiModD**. + +**Notes:** + +By default, the tool will check whether the input BAM file(s) provide(s) MD5 checksums for the reference genome sequences used during read alignment (the *SNAP Read Alignment* tool stores these in the BAM file header). If it finds MD5 sums for all sequences, it will compare them to the actual checksums of the sequences in the specified reference genome and +check that every sequence mentioned in any BAM input file has a counterpart with matching MD5 sum in the reference genome and abort with an error message if that is not the case. If it finds sequences with matching checksum, but different names in the reference genome, it will use the name from the reference genome file in its output. + +This behavior has two benefits: + +1) It protects from accidental variant calling against a wrong reference genome (i.e., a different one than that used during the alignment step), which would result in wrong calls. This is the primary reason why we recommend to leave the check activated + +2) It provides an opportunity to change sequence names between aligned reads files and variant call files by providing a reference genome file with altered sequence names (but identical sequence data). + +Since there may be rare cases where you *really* want to align against a reference genome with different checksums (e.g., you may have edited the reference sequence based on the alignment results), the check can be turned off, but only do this if you know exactly why. + +----------- + +Internally, the tool uses samtools mpileup combined with bcftools to do all per-nucleotide calculations. + +It exposes just a single configuration parameter of these tools - the *maximum per-BAM depth*. Through this parameter, the maximum number of reads considered for variant calling at any site can be controlled. Its default value of 250 is taken from *samtools mpileup* and usually suitable. Consider, however, that this gives the maximum read number per input file, so if you have a large number of samples in one input file, it could become necessary to increase the value to get sufficient reads considered per sample. + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snpeff_genomes.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,20 @@ +<tool id="snpeff_genomes" name="List Installed SnpEff Genomes"> + <description>Checks the local SnpEff installation to compile a list of currently installed genomes</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd snpeff-genomes -o "$outputfile" + </command> + <outputs> + <data name="outputfile" format="tabular" /> + </outputs> +<help> +.. class:: infomark + +**What it does** + +When executed this tool searches the host machine's SnpEff installation for properly registered and installed +genome annotation files. The resulting list is added as a plain text file to your history for use with the *Variant Annotation* Tool. + +</help> + +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,75 @@ +<?xml version="1.0"?> +<tool_dependency> + <package name="zlib" version="1.2.8"> + <repository changeset_revision="dce22a65bac2" name="package_zlib_1_2_8" owner="wolma" prior_installation_required="True" toolshed="https://toolshed.g2.bx.psu.edu" /> + </package> + <package name="python3" version="3.4.1"> + <repository changeset_revision="1c337560fa56" name="package_python3_zlib_dependent_1_0" owner="wolma" prior_installation_required="True" toolshed="https://toolshed.g2.bx.psu.edu" /> + </package> + + <package name="mimodd" version="0.1.5.2"> + <install version="1.0"> + <actions> + <action type="download_by_url">http://sourceforge.net/projects/mimodd/files/MiModD-0.1.5.2.tar.gz</action> + <action type="set_environment_for_install"> + <repository changeset_revision="1c337560fa56" name="package_python3_zlib_dependent_1_0" owner="wolma" toolshed="https://toolshed.g2.bx.psu.edu"> + <package name="python3" version="3.4.1" /> + </repository> + </action> + <action type="set_environment_for_install"> + <repository changeset_revision="dce22a65bac2" name="package_zlib_1_2_8" owner="wolma" toolshed="https://toolshed.g2.bx.psu.edu"> + <package name="zlib" version="1.2.8" /> + </repository> + </action> + <action type="shell_command">pyvenv --without-pip $INSTALL_DIR/MiModD_venv</action> + <!-- remove the plain python symlink from the venv to avoid its + accidental use by Galaxy, MiModD uses python3 explicitly --> + <action type="shell_command">rm $INSTALL_DIR/MiModD_venv/bin/python</action> + <!-- install MiModD placing the entry script mimodd into the venv's bin directory --> + <action type="shell_command">$INSTALL_DIR/MiModD_venv/bin/python3 setup.py install</action> + <!-- make MiModD's wrapped binaries executable --> + <action type="shell_command">chmod 755 $INSTALL_DIR/MiModD_venv/lib/python3.4/site-packages/MiModD/bin/*</action> + + + <action type="set_environment"> + <!-- make the mimodd entry script discoverable --> + <environment_variable action="prepend_to" name="PATH">$INSTALL_DIR/MiModD_venv/bin</environment_variable> + <!-- clear $PYTHONPATH and $PYTHONHOME --> + <environment_variable action="set_to" name="PYTHONPATH" /> + <environment_variable action="set_to" name="PYTHONHOME" /> + <!-- propagate $LD_LIBRARY_PATH --> + <environment_variable action="prepend_to" name="LD_LIBRARY_PATH">$ENV[LD_LIBRARY_PATH]</environment_variable> + </action> + + + </actions> + </install> + <readme> +Summary: Tools for Mutation Identification in Model Organism Genomes using Desktop PCs +Home-page: http://sourceforge.net/projects/mimodd/ +Author: Wolfgang Maier +Author-email: wolfgang.maier@biologie.uni-freiburg.de +License: GPL +Download-URL: http://sourceforge.net/projects/mimodd/ + +MiModD - Identify Mutations from Whole-Genome Sequencing Data +************************************************************* + +MiModD is an integrated solution for efficient and user-friendly analysis of +whole-genome sequencing (WGS) data from laboratory model organisms. +It enables geneticists to identify the genetic mutations present in an organism +starting from just raw WGS read data and a reference genome without the help of +a trained bioinformatician. + +MiModD is designed for good performance on standard hardware and enables WGS +data analysis for most model organisms on regular desktop PCs. + +MiModD can be installed under Linux and Mac OS with minimal software +requirements and a simple setup procedure. As a standalone package it can be +used from the command line, but can also be integrated seamlessly and easily +into any local installation of a Galaxy bioinformatics server providing a +graphical user interface, database management of results and simple composition +of analysis steps into workflows. + </readme> + </package> +</tool_dependency>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/varextract.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,97 @@ +<tool id="extract_variants" name="Extract Variant Sites"> + <description>from a BCF file</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd varextract "$ifile" + #if $len($sitesinfo) + -p + #for $source in $sitesinfo + "${source.pre_vcf}" + #end for + #end if + --ofile "$output_vcf" + $keep_alts + --verbose + </command> + + <inputs> + <param name="ifile" type="data" format="bcf" label="BCF input file" help="Use the Variant Calling tool to generate the input for this tool."/> + <repeat name="sitesinfo" title="include information from pre-calculated vcf file" default="0"> + <param name="pre_vcf" type="data" format="vcf" label="independently generated vcf file" /> + </repeat> + <param name="keep_alts" type="boolean" label="keep all sites with alternate bases" truevalue="-a" falsevalue="" checked="false" help="If selected, the VCF output will include ALL sites for which non-reference bases have been observed, i.e., even those not considered allelic sites by the variant caller." /> + </inputs> + <outputs> + <data name="output_vcf" format="vcf" label="Variants extracted with MiModd from ${on_string}"/> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool takes as input a BCF file like the ones produced by the *Variant Calling* tool, extracts just the variant sites from it and reports them in VCF format. + +If the BCF input file specifies multiple samples, sites are included if they qualify as variant sites in at least one sample. + +In a typical analysis workflow, you will use the tool's VCF output as input for the *VCF Filter* tool to cut down the often still impressive list of sites to a subset with relevance to your project. + +**Options:** + +1) By default, a variant site is considered to be a position in the genome for which a non-reference allele appears in the inferred genotype of any sample. + + You can select the *keep all sites with alternate bases* option, if instead you want to extract all sites, for which at least one non-reference base has been observed (whether resulting in a non-reference allele call or not). Using this option should rarely be necessary, but could be occassionally helpful for closer inspection of candidate genomic regions. + +2) During the process of variant extraction the tool can take into account genome positions specified in one or more independently generated VCF files. If such additional VCF input is provided, the tool output will contain the samples found in these files as additional samples and sites from the main BCF file will be included if they either qualify as variant sites in at least one sample specified in the BCF or if they are listed in any of the additional VCF files. + + Optional VCF input can be particularly useful in one of the following situations: + + *scenario i* - you have prior information that leads you to think that certain genome positions are of special relevance for your project and, thus, you are interested in the statistics produced by the variant caller for these positions even if they are not considered variant sites. In this case you can use a minimal VCF file to guide the variant extraction process to include these positions. This minimal VCF file needs a minimal header: + + ``##fileformat=VCFv4.2`` + + followed by positional information like in this example:: + + #CHROM POS ID REF ALT QUAL FILTER INFO + chrI 1222 . . . . . . + chrI 2651 . . . . . . + chrI 3659 . . . . . . + chrI 3731 . . . . . . + + , where columns are tab-separated and . serves as a placeholder for missing information. + + *scenario ii* - you have actual variant calls from an additional sample, but you do not have access to the original sequenced reads data (if you had, the recommended approach would be to align this data along with your other sequencing data or, at least, to perform the *Variant Calling* step together). + + This situation is often encountered with published datasets. Assume you have obtained a list of known single nucleotide variants (SNVs) found in one particular strain of your favorite model organism and you would like to know which of these SNVs are present in the related strains you have sequenced. You have aligned the sequenced reads from your samples and have used the *Variant Calling* tool, which has generated a BCF file ready for variant extraction. If the SNV list for the previously sequenced strain is in VCF format already, you can now just plug it into the analysis process by specifying it in the tool interface as an *independently generated vcf file*. The resulting vcf output file will contain all SNV sites along with the variant sites found in the BCF alone. You can then proceed to the *VCF Filter* tool to look at the original SNV sites only or to investigate any other interesting subset of sites. If the SNV list is in some other format, you will have o convert it to VCF first. At a minimum, the file must have a ``##fileformat`` header line like the previous example and have the ``REF`` and ``ALT`` column filled in like so:: + + #CHROM POS ID REF ALT QUAL FILTER INFO + chrI 1897409 . A G . . . + chrI 1897492 . C T . . . + chrI 1897616 . C A . . . + chrI 1897987 . A T . . . + chrI 1898185 . C T . . . + chrI 1898715 . G A . . . + chrI 1898729 . T C . . . + chrI 1900288 . T A . . . + + , in which case the tool will assume that the corresponding sample is homozygous for each of the SNVs. If you need to distinguish between homozygous and heterozygous SNVs you will have to extend the format to include a format and a sample column with genotype (GT) information like in this example:: + + #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleX + chrI 1897409 . A G . . . GT 1/1 + chrI 1897492 . C T . . . GT 0/1 + chrI 1897616 . C A . . . GT 0/1 + chrI 1897987 . A T . . . GT 0/1 + chrI 1898185 . C T . . . GT 0/1 + chrI 1898715 . G A . . . GT 0/1 + chrI 1898729 . T C . . . GT 0/1 + chrI 1900288 . T A . . . GT 0/1 + + , in which sampleX would be heterozygous for all SNVs except the first. + + .. class:: warningmark + + If the optional VCF input contains INDEL calls, these will be ignored by the tool. + + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/vcf_filter.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,124 @@ +<tool id="vcf_filter" name="VCF Filter"> + <description>Extracts lines from a vcf variant file based on field-specific filters</description> + <version_command>mimodd version -q</version_command> + <command> + mimodd vcf-filter + "$inputfile" + -o "$outputfile" + #if len($datasets): + -s + #for $i in $datasets + "$i.sample" + #end for + --gt + #for $i in $datasets + ## remove whitespace from free-text input + "#echo ("".join($i.GT.split()) or "ANY")#" + #echo " " + #end for + --dp + #for $i in $datasets + "$i.DP" + #end for + --gq + #for $i in $datasets + "$i.GQ" + #end for + #end if + #if len($regions): + -r + #for $i in $regions + #if $i.stop: + "$i.chrom:$i.start-$i.stop" + #else: + "$i.chrom:$i.start" + #end if + #end for + #end if + #if $vfilter: + --vfilter + ## remove ',' (and possibly adjacent whitespace) and replace with ' ' + "#echo ('" "'.join($vfilter.split(',')))#" + #end if + $vartype + </command> + + <inputs> + <param name="inputfile" type="data" format="vcf" label="VCF input file" /> + <repeat name="datasets" title="Sample-specific Filter" default="0" min="0"> + <param name="sample" type="text" label="sample" help="name of a sample as it appears in the VCF input file and that indicates the sample that this filter should be applied to." /> + <param name="GT" type="text" label="genotype pattern(s) for the inclusion of variants" help="keep only variants for which the genotype of the sample matches the specified pattern; format: x/x where x = 0 is wildtype and x = 1 is mutant. Multiple genotypes can be specified as a comma-separated list." /> + <param name="DP" type="integer" label="depth of coverage for the sample at the variant site" value = "0" help="keep only variants with at least this sample-specific coverage at the variant site" /> + <param name="GQ" type="integer" label="genotype quality for the variant in the sample" value = "0" help="keep only variants for which the genotype prediction for the sample has at least this quality" /> + </repeat> + <repeat name="regions" title="Region Filter" default="0" min="0" help = "Filter variant sites by their position in the genome. If multiple Region Filters are specified, all variants that fall in ONE of the regions are reported."> + <param name="chrom" type="text" label="Chromosome" /> + <param name="start" type="text" label="Region Start" /> + <param name="stop" type="text" label="Region End" /> + </repeat> + <param name="vartype" type="select" label="Select the types of variants to include in the output"> + <option value="">all types of variants</option> + <option value="--no-indels">exclude indels</option> + <option value="--indels-only">only indels</option> + </param> + <param name="vfilter" type="text" label="sample" help="Filter output by sample name; only the sample-specific columns with their sample name matching any of the comma separated filters will be retained in the output." /> + </inputs> + + <outputs> + <data name="outputfile" format="vcf" /> + </outputs> + + <help> +.. class:: infomark + + **What it does** + +The tool filters a variant file in VCF format to generate a new VCF file with only a subset of the original variants. + +The following types of variant filters can be set up: + +1) Sample-specific filters: + + Filter variants based on their characteristics in the sequenced reads of a specific sample. Multiple sample-specific filters are combined by logical AND, i.e., only variants that pass ALL sample-specific filters are kept. + +2) Region filters: + + Filter variants based on the genomic region they affect. Multiple region filters are combined by logical OR, i.e., variants passing ANY region filter are kept. + +3) Variant type filter: + + Filter variants by their type, i.e. whether they are single nucleotide variations (SNVs) or indels + +In addition, the *sample* filter can be used to reduce the samples encoded in a multi-sample VCF file to just those specified by the filter. +The *sample* filter is included mainly for compatibility reasons: if an external tool cannot deal with the multisample file format, but instead looks only at the first sample-specific column of the file, you can use the filter to turn the multi-sample file into a single-sample file. Besides, the filter can also be used to change the order of the samples since it will sort the samples in the order specified in the filter field. + +**Examples of sample-specific filters:** + +*Simple genotype pattern* + +genotype pattern: 1/1 ==> keep all variants in the vcf input file for which the specified sample's genotype is homozygous mutant + +*Complex genotype pattern* + +genotype pattern: 0/1, 0/0 ==> keep all variants for which the sample's genotype is either heterozygous or homozygous wildtype + +*Multiple sample-specific filters* + +Filter 1: genotype pattern: 0/0, Filter 2: genotype pattern 1/1: +==> keep all variants for which the first sample's gentoype is homozygous wildtype **and** the second sample's genotype is homozygous mutant + +*Combining sample-specific filter criteria* + +genotype pattern: 1/1, depth of coverage: 3, genotype quality: 9 +==> keep variants for which the sample's genotype is homozygous mutant **and** for which this genotype assignment is corroborated by a genotype quality score of at least 9 +**and** at least three reads from the sample cover the variant site + +**TIP:** + +As in the example above, genotype quality is typically most useful in combination with a genotype pattern. +It acts then, effectively, to make the genotype filter more stringent. + + + + </help> +</tool>