annotate tools/fasta_tools/fasta_filter_by_id.py @ 1:5cd569750e85

Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
author peterjc
date Tue, 07 Jun 2011 17:22:48 -0400
parents 2e5f8ad1a096
children 5b552b3005f2
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
2 """Filter a FASTA file with IDs from a tabular file, e.g. from BLAST.
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
3
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
4 Takes five command line options, tabular filename, ID column numbers
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
5 (comma separated list using one based counting), input FASTA filename, and
1
5cd569750e85 Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents: 0
diff changeset
6 two output FASTA filenames (for records with and without the given IDs).
5cd569750e85 Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents: 0
diff changeset
7
5cd569750e85 Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents: 0
diff changeset
8 If either output filename is just a minus sign, that file is not created.
0
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
9 This is intended to allow output for just the matched (or just the non-matched)
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
10 records.
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
11
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
12 Note in the default NCBI BLAST+ tabular output, the query sequence ID is
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
13 in column one, and the ID of the match from the database is in column two.
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
14 Here sensible values for the column numbers would therefore be "1" or "2".
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
15 """
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
16 import sys
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
17 from galaxy_utils.sequence.fasta import fastaReader, fastaWriter
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
18
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
19 def stop_err( msg ):
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
20 sys.stderr.write( msg )
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
21 sys.exit()
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
22
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
23 #Parse Command Line
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
24 try:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
25 tabular_file, cols_arg, in_file, out_positive_file, out_negative_file = sys.argv[1:]
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
26 except ValueError:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
27 stop_err("Expected five arguments, got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv)))
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
28 try:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
29 columns = [int(arg)-1 for arg in cols_arg.split(",")]
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
30 except ValueError:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
31 stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg)
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
32
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
33 #Read tabular file and record all specified identifiers
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
34 ids = set()
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
35 handle = open(tabular_file, "rU")
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
36 if len(columns)>1:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
37 #General case of many columns
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
38 for line in handle:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
39 if line.startswith("#"):
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
40 #Ignore comments
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
41 continue
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
42 parts = line.rstrip("\n").split("\t")
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
43 for col in columns:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
44 ids.add(parts[col])
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
45 print "Using %i IDs from %i columns of tabular file" % (len(ids), len(columns))
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
46 else:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
47 #Single column, special case speed up
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
48 col = columns[0]
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
49 for line in handle:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
50 if not line.startswith("#"):
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
51 ids.add(line.rstrip("\n").split("\t")[col])
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
52 print "Using %i IDs from tabular file" % (len(ids))
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
53 handle.close()
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
54
1
5cd569750e85 Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents: 0
diff changeset
55 #Write filtered FASTA file based on IDs from tabular file
0
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
56 reader = fastaReader(open(in_file, "rU"))
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
57 if out_positive_file != "-" and out_negative_file != "-":
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
58 print "Generating two FASTA files"
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
59 positive_writer = fastaWriter(open(out_positive_file, "w"))
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
60 negative_writer = fastaWriter(open(out_negative_file, "w"))
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
61 for record in reader:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
62 #The [1:] is because the fastaReader leaves the > on the identifer.
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
63 if record.identifier and record.identifier.split()[0][1:] in ids:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
64 positive_writer.write(record)
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
65 else:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
66 negative_writer.write(record)
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
67 positive_writer.close()
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
68 negative_writer.close()
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
69 elif out_positive_file != "-":
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
70 print "Generating matching FASTA file"
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
71 positive_writer = fastaWriter(open(out_positive_file, "w"))
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
72 for record in reader:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
73 #The [1:] is because the fastaReader leaves the > on the identifer.
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
74 if record.identifier and record.identifier.split()[0][1:] in ids:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
75 positive_writer.write(record)
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
76 positive_writer.close()
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
77 elif out_negative_file != "-":
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
78 print "Generating non-matching FASTA file"
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
79 negative_writer = fastaWriter(open(out_negative_file, "w"))
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
80 for record in reader:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
81 #The [1:] is because the fastaReader leaves the > on the identifer.
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
82 if not record.identifier or record.identifier.split()[0][1:] not in ids:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
83 negative_writer.write(record)
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
84 negative_writer.close()
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
85 else:
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
86 stop_err("Neither output file requested")
2e5f8ad1a096 Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
87 reader.close()