28 print "-reads1" + "\t\t" + "The number of reads to be used to calculate the correction factor for time 0." + "\n\t\t" + "(default counted from bowtie output)" + "\n"
29 print "-reads2" + "\t\t" + "The number of reads to be used to calculate the correction factor for time 6." + "\n\t\t" + "(default counted from bowtie output)" + "\n"
30 print "-cutoff" + "\t\t" + "Discard any positions where the average of counted transcripts at time 0 and time 1 is below this number (default 0)" + "\n"
31 print "-cutoff2" + "\t\t" + "Discard any positions within the normalization genes where the average of counted transcripts at time 0 and time 1 is below this number (default 0)" + "\n"
35 print "-multiply" + "\t" + "Multiply all fitness scores by a certain value (e.g., the fitness of a knockout). You should normalize the data." + "\n"
64 # The reference genome is required to find what gene each insertion falls within, the mapfiles are required because the relative number of insertions between time 2 and time 1 are used to calculate each insertion location's fitness, and the outfile is required so you... get an outfile.
70 # Sets the default value of the expansion factor to 250, which is not a particularly meaningful number; the expansion factor just needs to have some value so that rough fitness calculations can be run.
71 # Measuring and inputting your own expansion factor is highly recommended, as without it the fitness calculations will not be nearly as accurate. See the equations used to calculate fitness if you're confused about that.
81 # Sets the default value of cutoff to 0; cutoff exists to discard positions with a low number of counted transcripts, because fitnesses calculated from them may not be very accurate, by the same reasoning that studies with low sample sizes are innacurate.
86 # Sets the default value of cutoff2 to 10; cutoff2 exists to discard positions within normalization genes with a low number of counted transcripts, because fitnesses calculated from them similarly may not be very accurate.
109 # Uses Biopython's SeqIO to parse the reference genome file for features; these come from its feature table, which lists all known features and their relevance - if they're a gene, a recognizable repeat, etc.
120 # Makes a dictionary out of each feature that's a gene - with its gene name, start location, end location, and strand as keys to their values. Then makes a list out of all those dictionaries for ease of accessing later on.
130 # Exclude_first and exclude_last are used here to exclude whatever percentage of the genes you like from calculations; e.g. a value of 0.1 for exclude_last would exclude the last 10% of all genes!
158 # When called, goes through each line of the mapfile (each line is a compressed read) to find the strand, count, and position of the read. It may be helpful to look at how the mapfiles are formatted to understand how this code finds them.
161 # The position is the position of the sequence when alligned to the reference genome (found by bowtie, also before calc_fitness.py is called). This is the position of the FIRST nucleotide in the sequence, when mapped.
162 # Also records the total numbers of reads for the plus and minus strands, and the number of plus and minus sites from the mapfiles - not for fitness calculations, just for reference, as they're printed later.
177 # If for some reason you want to skip all reads from one of the strands - for example, if you wanted to compare the two strands - that's done here.
183 # Here we run into a tricky problem, in that the actual location of an insertion would be right before the position of the first nucleotide in a flanking sequence "following" the insertion, but right after the last nucleotide in a flanking sequence "preceding" the insertion.
184 # Thus the location of an insertion from a following flanking sequence would be the same as the mapped position of the sequence, but the location of an insertion from a preceding flanking sequence would be its mapped position + the length of the flanking sequence - 2.
185 # The -2 comes from a fake "TA" in the read, caused by MmeI cutting unevenly at an "AT" site so that an extra "TA" is created, artifically adding +2 to the flanking sequence length.
186 # However, it is currently impossible to distinguish between following and preceding flanking sequences, and so we split the difference by adding (the length of the sequence - 2) to only the + strand reads.
187 # In the future we will try not adding anything to both, or adding 1/2(the sequence length -2) to both + an - strands, to see if that improves calculations.
188 # The 'if arguments.downstream' option is there for a different method than the standard, in which reads are only taken from downstream of the transposon.
221 # Prints the number of + and - reads from reads1 and reads2, as well as the number of + and - sites found from the mapfiles, for reference when debugging etc.
249 # Calculates the correction factors for reads from t1 and t2; cfactor1 and cfactor2 are the number of reads from t1 and t2 respectively divided by total, which is the average number of reads between the two.
251 # This is used later on to correct for pipetting errors, or any other error that would cause unequal amounts of DNA from t1 and t2 to be sequenced so that an unequal amount of reads is produced
252 # However, because slightly more or less DNA shouldn't change the % frequency of a mutation (it would only change the absolute number of counts) this may not be necessary; in the future we'll try removing this.
270 # At each possible location for an insertion in the genome, counts the number of actual insertions at t1 and which strand(s) the corresponding reads came from.
283 # If there were no insertions at a certain location at t1 just continues to the next location; there can't be any comparison to make between t1 and t2 if there are no t1 insertions!
289 # At each location where there was an insertion at t1, counts the number of insertions at t2 and which strand(s) the corresponding reads came from.
303 # Corrects with cfactor1 and cfactor2; as noted above, this may not be necessary and we'll see what removing it does in the future. Possible alternate / non-corrected code is shown in the comments just below.
324 # Calculates each insertion's frequency within the populations at t1 and t2. If cfactor correction were removed, "total" would have to be changed to "float(arguments.reads1)" and "float(arguments.reads2)" respectively, as shown in the comments just below.
333 # Calculates each insertion's fitness! This is from the fitness equation log((frequency of mutation @ time 2 / frequency of mutation @ time 1)*expansion factor)/log((frequency of population without the mutation @ time 2 / frequency of population without the mutation @ time 1)*expansion factor)
351 # Writes all relevant information on each insertion and its fitness to a cvs file: the location of the insertion, its strand, c1, c2, etc. (the variable names are self-explanatiory)
361 # Prints the time when done comparing mapfiles - that is, when done with all fitness calculations but normalization - as well as the number of genic inserts and number of total inserts.
378 # If making a WIG file is requested in the arguments, starts a string to be added to and then written to the WIG file with a typical WIG file header.
384 # If a file's given for normalization, starts normalization; this corrects for anything that would cause all the fitness values to be too high or too low.
388 # Makes a list of the genes in the normalization file, which should all be transposon genes (these naturally ought to have a fitness value of exactly 1, because transposons are generally non-coding DNA)
400 # Finds all insertions within one of the normalization genes that also have a w value; gets their c1 and c2 values (the number of insertions at t1 and t2) and takes the average of that!
401 # The average is later used as the "weight" of an insertion location's fitness - if it's had more insertions, it should weigh proportionally more towards the average fitness of insertions within the normalization genes.
409 # Skips over those insertion locations with too few insertions - their fitness values are less accurate because they're based on such small insertion numbers.
419 # Tallies how many w values are 0 within the blank_ws value; you might get many transposon genes with a w value of 0 if a bottleneck occurs, which is especially common with in vivo experiments.
420 # For example, when studying a nasal infection in a mouse model, what bacteria "sticks" and is able to survive and what bacteria is swallowed and killed or otherwise flushed out tends to be a matter
426 # Adds the fitness values of the insertions within normalization genes together and increments count so their average fitness (sum/count) can be calculated later on
431 # Records the weights of the fitness values of the insertion locations in corresponding lists - for example, weights[2] would be the weight of the fitness value at score[2]
438 # Counts and removes all "blank" fitness values of normalization genes - those that = 0 - because they most likely don't really have a fitness value of 0, and you just happened to not get any reads from that location at t2.
452 # If no normalization genes can pass the cutoff, normalization cannot occur, so this ends the script advises the user to try again and lower cutoff and/or cutoff2.
458 # Prints the number of of blank fitness values found and removed for reference. Writes the percentage to a file so it can be referenced for aggregate analysis.
466 # Finds "average" - the average fitness value for an insertion within the transposon genes - and "weighted_average" - the average fitness value for an insertion within the transposon genes weighted by how many insertions each had.
486 # For example, if the average fitness for genes was too low overall - let's say 0.97 within the normalization geness - every fitness would be proportionally raised.
497 # For example you might multiply all the values by a constant for a genetic interaction screen - where Tn-Seq is performed as usual except there's one background knockout all the mutants share. This is
498 # because independent mutations should have a fitness value that's equal to their individual fitness values multipled, but related mutations will deviate from that; to find those deviations you'd multiply
499 # all the fitness values from mutants from a normal library by the fitness of the background knockout and compare that to the fitness values found from the knockout library!
534 # So what's written here is the WIG header plus each insertion position and it's new w value if normalization were called for, and each insertion position and its unnormalized w value if normalization were not called for.