Mercurial > repos > peterjc > fasta_filter_by_id
annotate tools/fasta_tools/fasta_filter_by_id.py @ 1:5cd569750e85
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
author | peterjc |
---|---|
date | Tue, 07 Jun 2011 17:22:48 -0400 |
parents | 2e5f8ad1a096 |
children | 5b552b3005f2 |
rev | line source |
---|---|
0
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
2 """Filter a FASTA file with IDs from a tabular file, e.g. from BLAST. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
3 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
4 Takes five command line options, tabular filename, ID column numbers |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
5 (comma separated list using one based counting), input FASTA filename, and |
1
5cd569750e85
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
6 two output FASTA filenames (for records with and without the given IDs). |
5cd569750e85
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
7 |
5cd569750e85
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
8 If either output filename is just a minus sign, that file is not created. |
0
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
9 This is intended to allow output for just the matched (or just the non-matched) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
10 records. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
11 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
12 Note in the default NCBI BLAST+ tabular output, the query sequence ID is |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
13 in column one, and the ID of the match from the database is in column two. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
14 Here sensible values for the column numbers would therefore be "1" or "2". |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
15 """ |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
16 import sys |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
17 from galaxy_utils.sequence.fasta import fastaReader, fastaWriter |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
18 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
19 def stop_err( msg ): |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
20 sys.stderr.write( msg ) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
21 sys.exit() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
22 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
23 #Parse Command Line |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
24 try: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
25 tabular_file, cols_arg, in_file, out_positive_file, out_negative_file = sys.argv[1:] |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
26 except ValueError: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
27 stop_err("Expected five arguments, got %i:\n%s" % (len(sys.argv)-1, " ".join(sys.argv))) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
28 try: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
29 columns = [int(arg)-1 for arg in cols_arg.split(",")] |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
30 except ValueError: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
31 stop_err("Expected list of columns (comma separated integers), got %s" % cols_arg) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
32 |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
33 #Read tabular file and record all specified identifiers |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
34 ids = set() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
35 handle = open(tabular_file, "rU") |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
36 if len(columns)>1: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
37 #General case of many columns |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
38 for line in handle: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
39 if line.startswith("#"): |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
40 #Ignore comments |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
41 continue |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
42 parts = line.rstrip("\n").split("\t") |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
43 for col in columns: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
44 ids.add(parts[col]) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
45 print "Using %i IDs from %i columns of tabular file" % (len(ids), len(columns)) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
46 else: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
47 #Single column, special case speed up |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
48 col = columns[0] |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
49 for line in handle: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
50 if not line.startswith("#"): |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
51 ids.add(line.rstrip("\n").split("\t")[col]) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
52 print "Using %i IDs from tabular file" % (len(ids)) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
53 handle.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
54 |
1
5cd569750e85
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
55 #Write filtered FASTA file based on IDs from tabular file |
0
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
56 reader = fastaReader(open(in_file, "rU")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
57 if out_positive_file != "-" and out_negative_file != "-": |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
58 print "Generating two FASTA files" |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
59 positive_writer = fastaWriter(open(out_positive_file, "w")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
60 negative_writer = fastaWriter(open(out_negative_file, "w")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
61 for record in reader: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
62 #The [1:] is because the fastaReader leaves the > on the identifer. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
63 if record.identifier and record.identifier.split()[0][1:] in ids: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
64 positive_writer.write(record) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
65 else: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
66 negative_writer.write(record) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
67 positive_writer.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
68 negative_writer.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
69 elif out_positive_file != "-": |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
70 print "Generating matching FASTA file" |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
71 positive_writer = fastaWriter(open(out_positive_file, "w")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
72 for record in reader: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
73 #The [1:] is because the fastaReader leaves the > on the identifer. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
74 if record.identifier and record.identifier.split()[0][1:] in ids: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
75 positive_writer.write(record) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
76 positive_writer.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
77 elif out_negative_file != "-": |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
78 print "Generating non-matching FASTA file" |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
79 negative_writer = fastaWriter(open(out_negative_file, "w")) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
80 for record in reader: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
81 #The [1:] is because the fastaReader leaves the > on the identifer. |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
82 if not record.identifier or record.identifier.split()[0][1:] not in ids: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
83 negative_writer.write(record) |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
84 negative_writer.close() |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
85 else: |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
86 stop_err("Neither output file requested") |
2e5f8ad1a096
Migrated tool version 0.0.2 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
87 reader.close() |