view tools/seq_length/seq_length.xml @ 4:17caf7a7c2c5 draft

Bump Biopython dependency
author peterjc
date Thu, 30 Nov 2023 09:49:28 +0000
parents fcdf11fb34de
children
line wrap: on
line source

<tool id="seq_length" name="Sequence lengths" version="0.0.5">
    <description>from FASTA, QUAL, FASTQ, or SFF file</description>
    <requirements>
        <requirement type="package" version="1.81">biopython</requirement>
    </requirements>
    <version_command>
python $__tool_directory__/seq_length.py --version
</version_command>
    <command detect_errors="aggressive">
python $__tool_directory__/seq_length.py -i '$input_file' -f '$input_file.ext' -o '$output_file'
#if $stats
  -s
#end if
    </command>
    <inputs>
        <param name="input_file" type="data" format="fasta,qual,fastq,sff" label="Sequence file" help="FASTA, QUAL, FASTQ, or SFF format." />
        <param name="stats" type="boolean" label="Compute additional statistics (median, N50)" />
    </inputs>
    <outputs>
        <data name="output_file" format="tabular" label="${on_string} length"/>
    </outputs>
    <tests>
        <test>
            <param name="input_file" value="four_human_proteins.fasta" ftype="fasta" />
            <output name="output_file" file="four_human_proteins.length.tabular" ftype="tabular" />
            <assert_stdout>
                <has_line line="4 sequences, total length 3297, mean 824.2" />
                <has_line line="Shortest 348, longest 1382" />
            </assert_stdout>
        </test>
        <test>
            <param name="input_file" value="SRR639755_sample_strict.fastq" ftype="fastq" />
            <output name="output_file" file="SRR639755_sample_strict.length.tabular" ftype="tabular" />
            <assert_stdout>
                <has_line line="2 sequences, total length 202, mean 101.0" />
                <has_line line="Shortest 101, longest 101" />
            </assert_stdout>
        </test>
        <test>
            <param name="input_file" value="MID4_GLZRM4E04_rnd30.sff" ftype="sff" />
            <param name="stats" value="true" />
            <output name="output_file" file="MID4_GLZRM4E04_rnd30.length.tabular" ftype="tabular" />
            <assert_stdout>
                <has_line line="30 sequences, total length 7504, mean 250.1" />
                <has_line line="Shortest 42, longest 473" />
                <has_line line="Median length 269.5, N50 345" />
            </assert_stdout>
        </test>
    </tests>
    <help>
**What it does**

Takes a FASTA, QUAL, FASTQ or Standard Flowgram Format (SFF) file and produces a
two-column tabular file containing one line per sequence giving the sequence
identifier and the associated sequence's length.

Additionally, the tool will report some basic statistics about the sequences
(visible via the output file's meta data, or the stdout log for the job),
namely the number of sequences, total length, mean length, minimum length and
maximum length.

You can optionally request additional statistics be computed which will use
more RAM and take fractionally longer, namely the median and N50.

WARNING: If there are any duplicate sequence identifiers, these will all appear
in the tabular output.

If using SFF files, this will use the trimmed lengths of the reads.

**References**

This tool uses Biopython's ``SeqIO`` library to read sequences, so please cite
the Biopython application note (and Galaxy too of course):

Cock et al (2009). Biopython: freely available Python tools for computational
molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878.

This tool is available to install into other Galaxy Instances via the Galaxy
Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/seq_length
    </help>
    <citations>
        <citation type="doi">10.1093/bioinformatics/btp163</citation>
    </citations>
</tool>