0
|
1 <!--
|
|
2 # =====================================================
|
|
3 # $Id: ExtractSeqsFromFasta.xml 90 2011-01-19 13:20:31Z pieter.neerincx@gmail.com $
|
|
4 # $URL: https://trac.nbic.nl/svn/galaxytools/trunk/tools/general/FastaTools/ExtractSeqsFromFasta.xml $
|
|
5 # $LastChangedDate: 2011-01-19 07:20:31 -0600 (Wed, 19 Jan 2011) $
|
|
6 # $LastChangedRevision: 90 $
|
|
7 # $LastChangedBy: pieter.neerincx@gmail.com $
|
|
8 # =====================================================
|
|
9 -->
|
|
10 <tool id="ExtractSeqsFromFasta1" version="1.1" name="ExtractSeqsFromFasta">
|
|
11 <description>Extract sequences from a FASTA file based on a list of IDs</description>
|
|
12 <command interpreter="perl">ExtractSeqsFromFasta.pl $ignore_accession_number_versions -f $identifiers -i $input -o $output -l WARN</command>
|
|
13 <inputs>
|
|
14 <param format="fasta" name="input" type="data" label="FASTA sequences"/>
|
|
15 <param format="txt" name="identifiers" type="data" label="List of IDs to extract sequences for"/>
|
|
16 <param name="ignore_accession_number_versions" type="boolean" truevalue="-u" falsevalue="" optional="true" label="Ignore accession number versions"/>
|
|
17 </inputs>
|
|
18 <outputs>
|
|
19 <data format="fasta" name="output" label="FASTA sequences for ${identifiers.name}"/>
|
|
20 </outputs>
|
|
21 <!--
|
|
22 <tests>
|
|
23 <test>
|
|
24 <param name="input" value="*.fasta"/>
|
|
25 <param name="identifiers" value="*.txt"/>
|
|
26 <output name="output" file="*.fasta"/>
|
|
27 </test>
|
|
28 </tests>
|
|
29 -->
|
|
30 <help>
|
|
31
|
|
32 .. class:: infomark
|
|
33
|
|
34 **What it does**
|
|
35
|
|
36 This tool filters a set of FASTA sequences for certain identifiers (IDs) or accession numbers. \
|
|
37 Only sequences whose ID or accession number is present in the supplied list will remain in the filtered FASTA output. \
|
|
38 The list of IDs or accession numbers to filter for must be a flat text file with one ID or accession per line.
|
|
39
|
|
40 This tool can match IDs with and without colon prefixed database namespaces in FASTA sequence header line. \
|
|
41 Hence your FASTA header can contain both >UniProtKB:Q86Y46 ... or just plain >Q86Y46 ... . \
|
|
42 Database namespace prefixes should not be present in the list of IDs that you want to extract sequences for.
|
|
43
|
|
44 FASTA headers may contain multiple IDs separated with pipe symbols (|) or semi colons (;). \
|
|
45 If multiple IDs are supplied these should not contain any white space as everything after the \
|
|
46 first white space is considered to be the (optional) description, which will not be matched against the list \
|
|
47 of IDs to extract.
|
|
48
|
|
49 If your FASTA file contains versioned IDs / accessions, your list of IDs / accessions to extract must also contain \
|
|
50 versioned IDs / accessions and the version numbers must match.
|
|
51
|
|
52 -----
|
|
53
|
|
54 **Example**
|
|
55
|
|
56 If the FASTA header is this::
|
|
57
|
|
58 >IPI:CON_IPI00174775.2|TREMBL:Q32MB2;Q86Y46 Tax_Id=9606 Gene_Symbol=KRT73 Keratin-73
|
|
59
|
|
60 The following IDs / accession numbers will match this sequence header::
|
|
61
|
|
62 CON_IPI00174775.2
|
|
63 Q32MB2
|
|
64 Q86Y46
|
|
65
|
|
66 These will not match::
|
|
67
|
|
68 IPI:CON_IPI00174775.2 (prefix should be removed)
|
|
69 KRT73 (ID part of description and not part of list of IDs,
|
|
70 which is everything up until the first white space.)
|
|
71
|
|
72 And finally these will not match unless *ignore accession number versions* is enabled::
|
|
73
|
|
74 CON_IPI00174775 (no version number, while FASTA file does contain versioned accession numbers)
|
|
75 CON_IPI00174775.1 (wrong version number)
|
|
76
|
|
77 </help>
|
|
78 </tool>
|