comparison ARTS/README @ 0:3723b54935cb draft

Uploaded
author mmaiensc
date Wed, 13 Nov 2013 16:13:17 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:3723b54935cb
1 ARTS: Automated Randomization of multiple Traits for Study design
2 Written by Mark Maienschein-Cline
3 mmaiensc@gmail.com
4 Center for Research Informatics
5 University of Illinois at Chicago
6
7 ARTS uses a genetic algorithm to optimize (minimize) a mutual information-based objective function, obtaining
8 an optimal randomization for studies of arbitrary size and design.
9
10 The publication for this code is in preparation; citation to be added soon (hopefully!). When it is published,
11 the section of the supplementary information will give more details about usage (in addition to what's below).
12
13 Please contact me at the email above with questions.
14
15
16
17 There are two ways of using this code: command-line (it's a perl script), or through Galaxy.
18
19 You can learn about, and download, Galaxy at http://galaxyproject.org.
20
21 ################
22 # INSTALLATION #
23 ################
24
25 #
26 # Command line version:
27 #
28 No installation needed, as long as you have a perl interpreter. Should work fine on a Mac or Linux system;
29 probably fine on Windows, but I haven't tested it.
30
31 #
32 # Galaxy version:
33 #
34 Two options:
35 1) You can download this tool from the Galaxy toolshed directly into your installation.
36 2) Move the ARTS.pl and .xml files into tools/ in your Galaxy distribution, and edit the tool_config file
37 appropriately. If you don't know how to do this, you should probably use strategy #1.
38
39 ###########
40 # RUNNING #
41 ###########
42
43 #
44 # Galaxy version
45 #
46 Once you get the tools installed in Galaxy, there are help sections in the tool descriptions you can refer to.
47 Also refer to the instructions for the command-line version below.
48
49 #
50 # Command line version:
51 #
52
53 Run ARTS.pl without any inputs to see the usage. All inputs are specified using the usual [-flag] [value]
54 syntax (i.e., -i input.txt).
55
56 Sample command using the sample_data.txt file:
57 ./ARTS.pl -i sample_data.txt -c "2,3,4,5;2;3;4;5" -b 10 -o batched_data.txt -cc 2,4 -cd 4
58
59
60 More information about the inputs (*'ed remarks refer to the values in the sample command above):
61
62 -i Input trait table: tab-delimited table, including 1 header line. See sample_data.txt for an example.
63 You can prepare this table in Excel and save as a tab-delimited text, or just write it in a text file,
64 or copy-paste from Excel to a text file. You can have more columns than you will actually care about
65 randomizing here.
66 * You can use the file sample_data.txt as an example input; there are 5 columns, Sample ID, Age, Sex,
67 Collection Date, and Disease.
68
69 -c Trait columns to randomize. This is a comma- and semicolon-delimited list. Its syntax is important,
70 so pay attention.
71 Columns are numbered starting from 1. Traits that should be considered jointly should be listed together
72 separated by commas. Each set of jointly considered traits should be listed separated by semicolons. Hence,
73 * -c "2,3,4,5;2;3;4;5" says to consider all the traits (columns 2-5) jointly (that's the 2,3,4,5 part), AND
74 to consider each trait individually (that's the ;2;3;4;5 part).
75 You could opt to only consider traits individually (-c "2;3;4;5"), or only jointly (-c "2,3,4,5"), or only
76 pair-wise (-c "2,3;2,4;2,5;3,4;3,5;4,5"), or whatever you want.
77 OUR GENERAL-PURPOSE RECOMMENDATION is to consider all traits jointly, plus all individually, as in the sample
78 command. This corresponds to the MMI statistic discussed in the publication.
79 GALAXY USERS: you just get to select the columns to consider, and the script will use the MMI statistic
80 automatically (you don't get a choice).
81 FINAL NOTE: you should put quotes around the value here, since otherwise semicolons will be interpreted
82 as end-of-line characters.
83
84 -b Batch size (number of samples that can be processed at the same time). You have two options:
85 1) Enter a single number. This will fill as many complete batches as possible, and put the remainder into a smaller
86 batch. This is probably convenient, but you should do a quick count to make sure you don't end up with a really
87 small last batch (e.g., if you have 105 samples and do batch size of 25, your last batch will only have 5 samples).
88 2) Enter a comma-delimited list that adds up to the number of samples, which allows for uneven batch sizes
89 For example, -b 10,10,9,9 for 38 samples. If your math doesn't add up, the program will exit and let you know.
90 * sample_data.txt has 30 samples, so "-b 10" makes 3 batches of 10 samples each.
91
92 -o Output file. Self-explanatory. The batch assignments are added as an extra column on the end, otherwise looks
93 like the input.
94 * batched_data.txt is our output file.
95
96 -p (sort-of optional: you MUST use both -b and -o, OR just -p) Print (to STDOUT) the statistics of a batched
97 run using this column. The result will look like the last part of the STDOUT from an ARTS run (see below),
98 but you can use this option for testing batch assignments from another algorithm, or if you did one by hand.
99
100 -cc Indices of continuously-valued columns. ARTS uses discrete values for its statistics, so these columns must
101 be discretized (binned). If ARTS encounters a column with more than 20 values, it will generate a warning asking
102 if you want it to be continuous. Comma-delimited list.
103 * In sample_data.txt, columns 2 (age) and 4 (date) could be considered continuous (that is, it's worth treating
104 a 35 year-old similarly to a 36 year-old), so we set "-cc 2,4".
105
106 -cd Date-valued columns. These columns should also be listed under -cc, but this lets ARTS know to expect a date
107 (format MUST be M/D/Y, where month is a number (1 instead of January)) and convert the date to a number before
108 binning.
109 * In sample_data.txt, column 4 is a date, so set "-cd 4".
110
111 -cb Number of bins to use for discretizing the continuous columns. Again, you can set a single value, or give a comma-
112 delimited list, which will match the order of the list given in the -cc flag.
113 * For the sample run, we left the default value of 5, but we could do, for example, "-cb 5,7", which would bin
114 the ages into 5 bins and the dates into 7 bins (since we set "-cc 2,4", and column 2 was age, column 4 was date).
115
116 -bn Name for the batch column added to the output. Default is "batch".
117
118 -s Random number seed. Set as a large negative integer. The code always uses the same seed, but if you want to
119 rerun with a different seed you can use this option.
120
121 ----------------------------------------------
122
123 When you run the sample command, the STDOUT looks like this (I added the N) line numbers):
124
125 """""""""""""""""""
126 1) Using traits: Age Sex Collection date Disease
127 2) Using trait combinations: {Age,Sex,Collection date,Disease} {Age} {Sex} {Collection date} {Disease}
128 3) Generation 1 of 300, average fitness 0.1432
129 4) Generation 2 of 300, average fitness 0.1342
130 5) Generation 3 of 300, average fitness 0.1298
131 6) Generation 4 of 300, average fitness 0.1279
132 7) Generation 5 of 300, average fitness 0.1250
133 8) Generation 6 of 300, average fitness 0.1227
134 9) Generation 7 of 300, average fitness 0.1211
135 10) Generation 8 of 300, average fitness 0.1194
136 11) Generation 9 of 300, average fitness 0.1187
137 12) Generation 10 of 300, average fitness 0.1181
138 13) Generation 11 of 300, average fitness 0.1175
139 14) Generation 12 of 300, average fitness 0.1165
140 15) Generation 13 of 300, average fitness 0.1143
141 16) Generation 14 of 300, average fitness 0.1133
142 17) Generation 15 of 300, average fitness 0.1132
143 18) Generation 16 of 300, average fitness 0.1127
144 19) Generation 17 of 300, average fitness 0.1123
145 20) Generation 18 of 300, average fitness 0.1116
146 21) Generation 19 of 300, average fitness 0.1119
147 22) Generation 20 of 300, average fitness 0.1113
148 23) Generation 21 of 300, average fitness 0.1113
149 24) Generation 22 of 300, average fitness 0.1110
150 25) Generation 23 of 300, average fitness 0.1110
151 26) Final MI 0.1045 ; Individual trait MIs (mean 0.0091 ): 0.0155 0.0000 0.0209 0.0000
152 27) -----------------------------------------------------------------
153 28) Age values Sex values Collection date values Disease values
154 29) Batch (size) 19-27.2 35.4-43.6 51.8-60 43.6-51.8 27.2-35.4 M F 2/26/2012-11/11/2012 11/11/2012-7/27/2013 6/14/2011-2/26/2012 9/29/2010-6/14/2011 1/15/2010-9/29/2010 Y N
155 30) ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
156 31) 1 (10) 2 2 2 1 3 5 5 3 2 2 2 1 5 5
157 32) 2 (10) 2 2 1 2 3 5 5 2 2 4 1 1 5 5
158 33) 3 (10) 3 2 1 1 3 5 5 3 2 2 2 1 5 5
159 34) ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- -------
160 35) Total 7 6 4 4 9 15 15 8 6 8 5 3 15 15
161 """""""""""""""""""
162
163 Here's what the lines mean:
164 1) Tells you what traits you've selected.
165 2) Tells you what trait combinations you've selected.
166 3-25) Prints the progress for each generation of the GA. Converges when average fitness changes by less than 0.0001.
167 26) Final objective function value. Normalized between 0 and 1, ideal case is 0. Note that different choices of the
168 objective function ARE NOT COMPARABLE: if you select fewer traits, or simpler combinations of traits (fewer
169 joint traits) using different -c values, you will get lower MI values, but this does not necessarily indicate better
170 overall randomization, because your choices may be overly simplistic. This is why we recommend sticking with the
171 MMI definition (all joint + all individual) consistently. This line also gives the randomization values for all
172 individual traits.
173 27-24) Inividual trait counts per batch for different values. Continuously-valued columns are given as a range
174 (e.g., age 19-27.2).
175 35) Total number of traits in each bin over all samples.
176
177 ----------------------------------------------
178
179 The output, batched_data.txt, will look like this:
180
181 """""""""""""""""""
182 Sample ID Age Sex Collection date Disease batch
183 sample1 25 M 3/28/2012 Y 3
184 sample2 37 F 4/27/2013 N 3
185 sample3 36 F 3/10/2013 N 1
186 sample4 52 M 7/1/2012 Y 1
187 sample5 48 M 8/13/2011 Y 3
188 sample6 60 M 9/21/2011 N 3
189 sample7 31 F 10/22/2010 Y 3
190 sample8 28 F 1/15/2010 N 2
191 sample9 26 M 1/7/2012 N 1
192 sample10 44 F 4/5/2012 Y 1
193 sample11 33 M 5/18/2012 N 3
194 sample12 25 F 7/27/2013 N 3
195 sample13 28 M 1/20/2013 Y 2
196 sample14 30 F 8/11/2012 Y 3
197 sample15 51 M 11/23/2011 N 2
198 sample16 22 M 12/21/2011 N 2
199 sample17 28 M 9/26/2010 Y 1
200 sample18 19 F 1/18/2010 Y 3
201 sample19 35 M 2/10/2012 N 1
202 sample20 38 F 2/17/2012 N 2
203 sample21 25 F 4/28/2012 Y 1
204 sample22 55 M 1/7/2013 Y 2
205 sample23 33 F 6/30/2013 N 1
206 sample24 24 M 7/1/2012 Y 2
207 sample25 42 M 2/15/2011 N 3
208 sample26 60 M 5/21/2011 N 1
209 sample27 34 F 10/23/2010 Y 2
210 sample28 37 F 12/18/2010 Y 1
211 sample29 41 F 11/7/2012 N 2
212 sample30 50 F 2/15/2012 Y 2
213 """""""""""""""""""
214
215 Looks the same as the input file, with a sixth column titled "batch" added, saying which of the three
216 batches each sample should be processed in (of course, you can permute the order of batches if you want).
217
218 Included file batched_data.txt is what the output should look like.
219
220
221
222
223
224
225
226
227