annotate iscc_similarity_parse_output.py @ 1:7d2c95a58897 draft default tip

planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
author imgteam
date Fri, 19 Dec 2025 15:03:29 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
1 #!/usr/bin/env python
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
2 """
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
3 Parse ISCC similarity output into tabular format with unique identifiers.
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
4
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
5 Input format (from iscc-sum --similar):
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
6 ISCC:K4AOMG... *file1.txt
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
7 ~08 ISCC:K4AOMG... *file2.txt
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
8 ~10 ISCC:K4AOMG... *file3.txt
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
9 ISCC:K4AGSPO... *file4.txt
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
10
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
11 Output format (tabular with 7 columns, bidirectional):
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
12 file_id filename iscc_code match_id match_filename match_iscc_hash distance
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
13 23 file1.txt K4AOMG... 24 file2.txt K4AOMG... 8
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
14 24 file2.txt K4AOMG... 23 file1.txt K4AOMG... 8
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
15 25 file4.txt K4AGSPO... -1
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
16 """
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
17 import argparse
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
18
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
19
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
20 def clean_filename(filename):
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
21 """Remove directory prefix from filename."""
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
22 # Remove 'input_files/' prefix if present
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
23 if filename.startswith('input_files/'):
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
24 filename = filename[len('input_files/'):]
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
25
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
26 return filename
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
27
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
28
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
29 def load_id_mapping(mapping_file):
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
30 """Load filename to element_identifier mapping.
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
31
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
32 Returns: dict mapping cleaned filename -> element_identifier
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
33 """
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
34 mapping = {}
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
35 with open(mapping_file, 'r') as f:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
36 for line in f:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
37 parts = line.strip().split('\t')
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
38 if len(parts) == 2:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
39 filename, element_id = parts
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
40 # Clean the filename the same way as in parse
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
41 cleaned = clean_filename(filename)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
42 mapping[cleaned] = element_id
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
43 return mapping
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
44
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
45
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
46 def parse_iscc_line(line):
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
47 """Parse ISCC line and extract code and filename.
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
48
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
49 Format: "ISCC:CODE *filename" or " ~NN ISCC:CODE *filename"
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
50 Returns: (code, filename) or (None, None) if parse fails
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
51 """
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
52 # Find the * separator
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
53 if ' *' not in line:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
54 return None, None
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
55
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
56 # Split on ' *' to get code part and filename
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
57 parts = line.split(' *', 1)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
58 code_part = parts[0].strip()
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
59 filename = clean_filename(parts[1].strip())
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
60
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
61 # Extract CODE (after 'ISCC:')
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
62 if 'ISCC:' in code_part:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
63 code = code_part.split('ISCC:', 1)[1].strip()
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
64 else:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
65 code = ''
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
66
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
67 return code, filename
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
68
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
69
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
70 def main():
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
71 parser = argparse.ArgumentParser(
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
72 description='Parse ISCC similarity output into tabular format'
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
73 )
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
74 parser.add_argument(
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
75 'similarity_raw',
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
76 help='Raw similarity output from iscc-sum --similar'
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
77 )
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
78 parser.add_argument(
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
79 'id_mapping',
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
80 help='TSV file mapping filenames to element identifiers'
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
81 )
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
82 parser.add_argument(
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
83 'output_file',
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
84 help='Tabular output file'
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
85 )
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
86 args = parser.parse_args()
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
87
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
88 # Load ID mapping
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
89 id_map = load_id_mapping(args.id_mapping)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
90
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
91 # Parse similarity output
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
92 file_codes = {} # filename -> code mapping
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
93 matches = [] # List of (file1, code1, file2, code2, distance)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
94 current_ref = None
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
95 current_code = None
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
96
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
97 with open(args.similarity_raw, 'r') as f:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
98 for line in f:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
99 line = line.rstrip()
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
100 if not line:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
101 continue
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
102
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
103 if line.startswith('ISCC:'):
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
104 # Reference file: "ISCC:CODE *filename"
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
105 code, filename = parse_iscc_line(line)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
106 if code and filename:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
107 current_ref = filename
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
108 current_code = code
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
109 file_codes[filename] = code
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
110
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
111 elif line.startswith(' ') and current_ref:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
112 # Similar file: " ~NN ISCC:CODE *filename"
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
113 parts = line.strip().split(None, 1) # Split on first whitespace
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
114 if len(parts) == 2:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
115 dist_str = parts[0].replace('~', '')
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
116 distance = int(dist_str)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
117
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
118 # Parse the rest of the line for ISCC and filename
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
119 code, filename = parse_iscc_line(parts[1])
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
120
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
121 if code and filename:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
122 matches.append((current_ref, current_code, filename, code, distance))
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
123 file_codes[filename] = code
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
124 # Write output with identifiers
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
125 with open(args.output_file, 'w') as out:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
126 # Write header (7 columns)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
127 out.write("file_id\tfilename\tiscc_code\tmatch_id\tmatch_filename\tmatch_iscc_code\tdistance\n")
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
128
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
129 # Track which files have matches
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
130 files_with_matches = set()
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
131
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
132 # Write similarity matches in both directions
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
133 for file1, code1, file2, code2, distance in matches:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
134 # Get element identifiers
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
135 file1_name = id_map[file1]
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
136 file2_name = id_map[file2]
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
137 file1_id = str.split(file1, '_', 1)[0] # Extract ID from filename
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
138 file2_id = str.split(file2, '_', 1)[0] # Extract ID from filename
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
139
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
140 # Write A -> B (file_id is the numeric ID, filename is the element_identifier)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
141 out.write(f"{file1_id}\t{file1_name}\t{code1}\t{file2_id}\t{file2_name}\t{code2}\t{distance}\n")
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
142 # Write B -> A (bidirectional)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
143 out.write(f"{file2_id}\t{file2_name}\t{code2}\t{file1_id}\t{file1_name}\t{code1}\t{distance}\n")
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
144
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
145 files_with_matches.add(file1)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
146 files_with_matches.add(file2)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
147
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
148 # Write files with no matches (distance = -1, empty match columns)
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
149 for filename in sorted(file_codes.keys()):
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
150 if filename not in files_with_matches:
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
151 file_id = str.split(filename, '_', 1)[0] # Extract ID from filename
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
152 element_name = id_map[filename]
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
153 code_val = file_codes[filename]
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
154 out.write(f"{file_id}\t{element_name}\t{code_val}\t\t\t\t-1\n")
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
155
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
156
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
157 if __name__ == '__main__':
7d2c95a58897 planemo upload for repository https://github.com/BMCV/galaxy-image-analysis/tree/master/tools/iscc-sum commit 6db86b8b65a0e05b7f3541d505fbe900633fc72a
imgteam
parents:
diff changeset
158 main()