annotate scripts/README.txt @ 0:3b33da018e74 draft default tip

Imported from capsule None
author devteam
date Mon, 19 May 2014 12:33:42 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
1 This file explains how to create annotation indexes for the annotation profiler tool. Annotation profiler indexes are an exceedingly simple binary format,
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
2 containing no header information and consisting of an ordered linear list of (start,stop encoded individually as '<I') regions which are covered by a UCSC table partitioned
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
3 by chromosome name. Genomic regions are merged by overlap / direct adjacency (e.g. a table having ranges of: 1-10, 6-12, 12-20 and 25-28 results in two merged ranges of: 1-20 and 25-28).
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
4
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
5 Files are arranged like:
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
6 /profiled_annotations/DBKEY/TABLE_NAME/
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
7 CHROMOSOME_NAME.covered
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
8 CHROMOSOME_NAME.total_coverage
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
9 CHROMOSOME_NAME.total_regions
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
10 /profiled_annotations/DBKEY/
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
11 DBKEY_tables.xml
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
12 chromosomes.txt
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
13 profiled_info.txt
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
14
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
15
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
16 where CHROMOSOME_NAME.covered is the binary file, CHROMOSOME_NAME.total_coverage is a text file containing the integer count of bases covered by the
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
17 table and CHROMOSOME_NAME.total_regions contains the integer count of the number of regions found in CHROMOSOME_NAME.covered
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
18
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
19 DBKEY_tables.xml should be appended to the annotation profile available table configuration file (tool-data/annotation_profiler_options.xml).
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
20 The DBKEY should also be added as a new line to the annotation profiler valid builds file (annotation_profiler_valid_builds.txt).
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
21 The output (/profiled_annotations/DBKEY) should be made available as GALAXY_ROOT/tool-data/annotation_profiler/DBKEY.
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
22
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
23 profiled_info.txt contains info on the generated annotations, separated by lines with tab-delimited label,value pairs:
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
24 profiler_version - the version of the build_profile_indexes.py script that was used to generate the profiled data
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
25 dbkey - the dbkey used for the run
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
26 chromosomes - contains the names and lengths of chromosomes that were used to parse single-chromosome tables (tables divided into individual files by chromosome)
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
27 dump_time - the declared dump time of the database, taken from trackDb.txt.gz
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
28 profiled_time - seconds since epoch in utc for when the database dump was profiled
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
29 database_hash - a md5 hex digest of all the profiled table info
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
30
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
31
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
32 Typical usage includes:
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
33
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
34 python build_profile_indexes.py -d hg19 -i /ucsc_data/hg19/database/ > hg19.txt
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
35
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
36 where the genome build is hg19 and /ucsc_data/hg19/database/ contains the downloaded database dump from UCSC (e.g. obtained by rsync: rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ /ucsc_data/hg19/database/).
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
37
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
38
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
39
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
40 By default, chromosome names come from a file named 'chromInfo.txt.gz' found in the input directory, with FTP used as a backup.
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
41 When FTP is used to obtain the names of chromosomes from UCSC for a particular genome build, alternate ftp sites and paths can be specified by using the --ftp_site and --ftp_path attributes.
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
42 Chromosome names can instead be provided on the commandline via the --chromosomes option, which accepts a comma separated list of:ChromName1[=length],ChromName2[=length],...
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
43
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
44
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
45
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
46 usage = "usage: %prog options"
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
47 parser = OptionParser( usage=usage )
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
48 parser.add_option( '-d', '--dbkey', dest='dbkey', default='hg18', help='dbkey to process' )
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
49 parser.add_option( '-i', '--input_dir', dest='input_dir', default=os.path.join( 'golden_path','%s', 'database' ), help='Input Directory' )
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
50 parser.add_option( '-o', '--output_dir', dest='output_dir', default=os.path.join( 'profiled_annotations','%s' ), help='Output Directory' )
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
51 parser.add_option( '-c', '--chromosomes', dest='chromosomes', default='', help='Comma separated list of: ChromName1[=length],ChromName2[=length],...' )
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
52 parser.add_option( '-b', '--bitset_size', dest='bitset_size', default=DEFAULT_BITSET_SIZE, type='int', help='Default BitSet size; overridden by sizes specified in chromInfo.txt.gz or by --chromosomes' )
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
53 parser.add_option( '-f', '--ftp_site', dest='ftp_site', default='hgdownload.cse.ucsc.edu', help='FTP site; used for chromosome info when chromInfo.txt.gz method fails' )
3b33da018e74 Imported from capsule None
devteam
parents:
diff changeset
54 parser.add_option( '-p', '--ftp_path', dest='ftp_path', default='/goldenPath/%s/chromosomes/', help='FTP Path; used for chromosome info when chromInfo.txt.gz method fails' )