# HG changeset patch
# User devteam@galaxyproject.org
# Date 1429719568 14400
# Node ID b27006b0a9530887bb31088ff4d0717c6ffc2f44
# Parent ecfc9041bcc55ea012fab7007851b4f5813395a5
update to latest version
diff -r ecfc9041bcc5 -r b27006b0a953 GenotypeTRcorrection.py
--- a/GenotypeTRcorrection.py Wed Apr 01 14:05:54 2015 -0400
+++ b/GenotypeTRcorrection.py Wed Apr 22 12:19:28 2015 -0400
@@ -2,7 +2,6 @@
import sys
import collections, math
import heapq
-from galaxy import eggs
@@ -205,7 +204,7 @@
## scope filter
#########################################
-######## prob calculation option ########
+######## prob calculation sector ########
#########################################
homozygous_collector=0
heterozygous_collector=0
diff -r ecfc9041bcc5 -r b27006b0a953 GenotypingSTR.xml
--- a/GenotypingSTR.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/GenotypingSTR.xml Wed Apr 22 12:19:28 2015 -0400
@@ -1,5 +1,5 @@
-
- during sequencing and library prep
+
+ that occur during sequencing and library prep
GenotypeTRcorrection.py $microsat_raw $microsat_error_profile $microsat_corrected $expectedminorallele
@@ -28,45 +28,45 @@
**What it does**
-- This tool will correct for microsatellite sequencing and library preparation errors using error rates estimated from hemizygous male X chromosome or any rates provided by user. The read profile for each locus will be processed independently.
-- First, this tool will find three most common read lengths from input read length profile. If the read profile has only one length of TR, the length of one motif longer than the observed length will be used as the second most common read length.
-- Second, it will calculate probability of three forms of homozygous and use the form which give the highest probability. The same goes for heterozygous.
-- Third, this tools will calculate log based 10 of (the probability of homozygous/the probability of heterozygous). If this value is more than 0, it will predict this locus to homozygous. If this value is less than 0, it will predict this locus to heterozygous. If this value is 0, read profile at this locus will be discard.
+- This tool will correct for STR sequencing and library preparation errors using error rates estimated from hemizygous male X chromosome (https://usegalaxy.org/u/guru%40psu.edu/h/error-rates-files) or rates provided by user. The STR length profile for each locus will be processed independently.
+- First, this tool will find three most common STR lengths from input STR length profile. If the STR length profile has only one length of STR, the length of one motif longer than the observed length will be used as the second most common STR length.
+- Second, it will calculate probability of three forms of homozygotes and use the form with the highest probability. The same goes for heterozygotes.
+- Third, this tools will calculate log10 of the ratio of the probability of homozygote to the probability of heterozygote. If this value is more than 0, it will predict this locus to be homozygote. If this value is less than 0, it will predict this locus to be heterozygote. If this value is 0, read profile at this locus will be discarded.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
**Input**
- The input files need to contain at least three columns.
-- Column 1 = location of microsatellite locus.
-- Column 2 = length profile (length of microsatellite in each read that mapped to this location in comma separated format).
-- Column 3 = motif of microsatellite in this locus. The input file can contain more than three column.
+- Column 1 = location of STR locus.
+- Column 2 = length profile (length of STR in each read that mapped to this location in comma separated format).
+- Column 3 = motif of STR in this locus. The input file can contain more than three columns.
**Output**
-The output will be contain original three (or more) column as the input. However, it will also have these following columns.
+The output will be contain original three (or more) columns as the input. However, it will also have these following columns.
-- Additional column 1 = homozygous/heterozygous label.
-- Additional column 2 = log based 10 of (the probability of homozygous/the probability of heterozygous)
-- Additional column 3 = Allele for most probable homozygous form.
-- Additional column 4 = Allele 1 for most probable heterozygous form.
-- Additional column 5 = Allele 2 for most probable heterozygous form.
+- Additional column 1 = homozygote/heterozygote label.
+- Additional column 2 = log based 10 of (the probability of homozygote/the probability of heterozygote)
+- Additional column 3 = Allele for most probable homozygote.
+- Additional column 4 = Allele 1 for most probable heterozygote.
+- Additional column 5 = Allele 2 for most probable heterozygote.
**Example**
-- Suppose that we sequence one locus of microsatellite with NGS. This locus has **A** motif and the following length (bp) profile. ::
+- Suppose that we sequence a locus of STR with NGS. This locus has **A** motif and the following STR length (bp) profile. ::
chr1_100_106 5, 6, 6, 6, 6, 7, 7, 8, 8 A
-- We want to figure out if this locus is a homolozygous or heterozygous and the corresponding allele(s). Therefore, we use this tool to refine genotype.
-- This tool will calculate the probability of homozygous A6A6, A7A7, and A8A8 to generate observed length profile. Among this A7A7 has the highest probability. Therefore, we use this form as the representative for homozygous.
-- Then, this tool will calculate the probability of heterozygous A6A7, A7A8, and A6A8 to generate observed length profile. Among this A6A8 has the highest probability. Therefore, we use this form as the representative for heterozygous.
-- The A6A7 has higher probability than A7A7. Therefore, the program will report that this locus is a heterozygous locus. ::
+- We want to figure out if this locus is a homoozygote or heterozygote and the corresponding allele(s). Therefore, we use this tool to refine genotype.
+- This tool will calculate the probability of homozygote A6A6, A7A7, and A8A8 to generate the observed STR length profile. Among this A7A7 has the highest probability. Therefore, we use this form as the representative for homozygote.
+- Then, this tool will calculate the probability of heterozygote A6A7, A7A8, and A6A8 to generate the observed STR length profile. Among this A6A8 has the highest probability. Therefore, we use this form as the representative for heterozygote.
+- Finally, it will compare the representative homozygous and heterozygous forms. The A6A8 has higher probability than A7A7. Therefore, the program will report that this locus as a heterozygous locus of form A6A8. ::
chr1 5,6,6,6,6,7,7,8,8 A hetero -14.8744881854 7 6 8
-
\ No newline at end of file
+
diff -r ecfc9041bcc5 -r b27006b0a953 PEsortedSAM2readprofile.xml
--- a/PEsortedSAM2readprofile.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/PEsortedSAM2readprofile.xml Wed Apr 22 12:19:28 2015 -0400
@@ -1,5 +1,5 @@
-
- from SAM file sorted by readname
+
+ and get the reference STR allele from the reference genome
PEsortedSAM2readprofile.py $flankedbasesSAM $twobitref $maxTRlength $maxoriginalreadlength $output
@@ -30,33 +30,32 @@
**What it does**
-- This tool will take SAM file sorted by read name, remove unpaired reads, report microsatellites sequences in the reference genome that correspond to the space between paired end reads. Coordinate of start and stop for left and right flanking regions of microsatellites and microsatellite itself as inferred from paired end reads will also be reported.
-- These microsatellites in reference can be used to filter out reads that do not contain microsatellites that concur with microsatellites in reference where the reads mapped to.
+- This tool will take SAM file (sorted by read name), remove unpaired reads, and combine paired faux read-pairs into a single row. It also reports Short Tandem Repeats (STRs) sequences in the reference genome that correspond to the space between the faux paired end reads and the coordinate of start and stop for left and right flanking regions of STRs.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
**Input**
-- Sorted SAM files by read name
+- Sorted SAM files by read name.
**Output**
-The output will combined two lines of input which are paired. The output format is as follow.
+The output will combine the two faux paired-end read lines of input ito the following single line format:
- Column 1 = read name
- Column 2 = chromosome
- Column 3 = left flanking region start
- Column 4 = left flanking region stop
-- Column 5 = microsatellite start
-- Column 6 = microsatellite stop
+- Column 5 = STR start
+- Column 6 = STR stop
- Column 7 = right flanking region start
- Column 8 = right flanking region stop
-- Column 9 = microsatellite length in reference
-- Column 10= microsatellite sequence in reference
+- Column 9 = STR length in reference
+- Column 10= STR sequence in reference
-
\ No newline at end of file
+
diff -r ecfc9041bcc5 -r b27006b0a953 combineprobforallelecombination.xml
--- a/combineprobforallelecombination.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/combineprobforallelecombination.xml Wed Apr 22 12:19:28 2015 -0400
@@ -1,4 +1,4 @@
-
+
from the same allele combination
combinedprobforallelecombination.py $input > $output
@@ -24,44 +24,44 @@
**What it does**
-- This tool will combine probability that the allele combination can generated any read profile in the input. This is the last step to calculate probability to detect heterozygous for each allele combination and each depth.
+- This tool will combine the read profile probabilities for each allele combination in the input and calculates the probability to detect heterozygote for each allele combination and each depth.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
**Input**
The input format is the same as output from **Evaluate the probability of the allele combination to generate read profile** tool.
-- Column 1 = location of microsatellite locus.
-- Column 2 = length profile (length of microsatellite in each read that mapped to this location in comma separated format).
-- Column 3 = motif of microsatellite in this locus. The input file can contain more than three column.
-- Column 4 = homozygous/heterozygous label.
-- Column 5 = log based 10 of (the probability of homozygous/the probability of heterozygous)
-- Column 6 = Allele for most probable homozygous form.
-- Column 7 = Allele 1 for most probable heterozygous form.
-- Column 8 = Allele 2 for most probable heterozygous form.
+- Column 1 = location of STR locus.
+- Column 2 = length profile (length of STR in each read that mapped to this location in comma separated format).
+- Column 3 = motif of STR in this locus. The input file can contain more than three columns.
+- Column 4 = homozygote/heterozygote label.
+- Column 5 = log based 10 of (the probability of homozygote/the probability of heterozygote)
+- Column 6 = Allele for most probable homozygote.
+- Column 7 = Allele 1 for most probable heterozygote.
+- Column 8 = Allele 2 for most probable heterozygote.
- Column 9 = Probability of the allele combination to generate given read profile.
- Column 10 = Number of possible rearrangement of given read profile.
- Column 11 = Probability of the allele combination to generate read profile with any rearrangement (Product of column 9 and column 10)
- Column 12 = Read depth
-Only column 2,3,4,7,8,11 were used in calculation.
+Only columns 2,3,4,7,8,11 were used in calculation.
**Output**
-The output will contain the following header and column
+The output will contain the following header and columns
- Line 1 header: read_depth allele heterozygous_prob motif
- Column 1 = read depth
- Column 2 = allele combination
-- Column 3 = probability to detect heterozygous of that allele combination
+- Column 3 = probability to detect heterozygote of that allele combination
- Column 4 = motif
-
\ No newline at end of file
+
diff -r ecfc9041bcc5 -r b27006b0a953 fetchflank.xml
--- a/fetchflank.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/fetchflank.xml Wed Apr 22 12:19:28 2015 -0400
@@ -1,5 +1,5 @@
-
- of microsatellites and output as two fastq files in forward-forward orientation
+
+ the STRs in the reads and output two fastq files in forward-forward orientation
pair_fetch_DNA_ff.py $microsat_in_read $Leftflanking $Rightflanking $qualitycutoff $lengthofbasetocheckquality
@@ -29,32 +29,30 @@
**What it does**
-This tool will fetch flanking regions around microsatellites, screen for quality score at microsatellites and adjacent flanking regions, and output two fastq files containing flanking regions in forward-forward direction.
+This tool will fetch flanking regions around STRs from the reads output by "STR detection" step, screen for quality score at STRs and adjacent flanking regions, and output two fastq files containing flanking regions in forward-forward direction.
- This tool assumes that the quality score is Phred+33, such as Sanger fastq.
- Reads that have either left or right flanking regions shorter than the length of flanking regions that require quality screening will be removed.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
**Input**
-The input files need to be in the same format as output from **microsatellite detection program**. This format contains **length of repeat**, **length of left flanking region**, **length of right flanking region**, **repeat motif**, **hamming (editing) distance**, **read name**, **read sequence**, **read quality score**
+The input file needs to be in the same format as output from **STR detection** step. This format contains **length of repeat**, **length of left flanking region**, **length of right flanking region**, **repeat motif**, **hamming (editing) distance**, **read name**, **read sequence**, **read quality score**
**Output**
-The output will be the two fastq files. The first file contains left flank regions. The second file contains right flanking regions.
+The output will be two fastq files. The first file contains left flanking bases. The second file contains right flanking bases.
**Example**
-- Suppose we detected the microsatellites from short reads ::
+- Starting with this test input ::
6 40 54 G 0 SRR345592.75000006 HS2000-192_107:1:63:5822:176818_1_per1_1 TACCCTCCTGTCTTCCCAGACTGATTTCTGTTCCTGCCCTggggggTTCTTGACTCCTCTGAATGGGTACGGGAGTGTGGACCTCAGGGAGGCCCCCTTG GGGGGGGGGGGGGGGGGFGGGGGGGGGFEGGGGGGGGGGG?FFDFGGGGGG?FFFGGGGGDEGGEFFBEFCEEBD@BACB*?=99(/=5'6=4:CCC*AA
-- We want to get fastq files of flanking regions around microsatellite with quality score at least 20 on Phred +33
-
-- Then the program will report these two fastq files ::
+- If we want to get fastq files of flanking regions around the detected STRs with quality score of at least 20, the program will report these two fastq files ::
@SRR345592.75000006 HS2000-192_107:1:63:5822:176818_1_per1_1
TACCCTCCTGTCTTCCCAGACTGATTTCTGTTCCTGCCCT
@@ -70,4 +68,4 @@
-
\ No newline at end of file
+
diff -r ecfc9041bcc5 -r b27006b0a953 microsatcompat.xml
--- a/microsatcompat.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/microsatcompat.xml Wed Apr 22 12:19:28 2015 -0400
@@ -1,4 +1,4 @@
-
+
microsatcompat.py $input $column1 $column2 > $output
@@ -28,44 +28,44 @@
**What it does**
-This tool is used to select only the input lines which have compatible microsatellite motifs between two columns. Compatible here is defined as the microsatellites motif that are complementary or have the same sequence when change starting point of motif. For example, **A** is the same as **T**. Also, **AGG** is the same as **GAG**.
+This tool is used to select only those input lines that have compatible STR motifs between the two user-specified columns. Two STR motifs are called compatible if they are either identical, or complementary, or produce the same sequence on rotating the start of the motif. For example, **A** is considered compatible with **A** and its reverse complement **T**. Similarly, **AGG** considered compatible with **AGG**, its reverse complement **TCC**, and their rotations **GGA**, **GAG**, **CCT** and **CTC**.
-For TRFM pipeline (profiling microsatellites in short read data), this tool can be used to make sure that the microsatellites in the reads have the same motif as the microsatellites in the reference at the corresponding mapped location.
+For STR-FM pipeline (profiling STRs in short read data), this tool can be used to make sure that the STRs in the reads have the compatible motif as the STRs in the reference at the corresponding mapped location.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
**Input**
The input files can be any tab delimited file.
-If this tool is used in TRFM microsatellite profiling, it should contains:
+If this tool is used in STR-FM pipeline for STRs profiling, it should contains:
-- Column 1 = microsatellite location in reference chromosome
-- Column 2 = microsatellite location in reference start
-- Column 3 = microsatellite location in reference stop
-- Column 4 = microsatellite location in reference motif
-- Column 5 = microsatellite location in reference length
-- Column 6 = microsatellite location in reference motif size
-- Column 7 = length of microsatellites (bp)
-- Column 8 = length of left flanking regions (bp)
-- Column 9 = length of right flanking regions (bp)
+- Column 1 = STR location in reference chromosome
+- Column 2 = STR location in reference start
+- Column 3 = STR location in reference stop
+- Column 4 = STR location in reference motif
+- Column 5 = STR location in reference length
+- Column 6 = STR location in reference motif size
+- Column 7 = length of STR (bp)
+- Column 8 = length of left flanking region (bp)
+- Column 9 = length of right flanking region (bp)
- Column 10 = repeat motif (bp)
- Column 11 = hamming distance
- Column 12 = read name
-- Column 13 = read sequence with soft masking of microsatellites
+- Column 13 = read sequence with soft masking of STR
- Column 14 = read quality (the same Phred score scale as input)
- Column 15 = read name (The same as column 12)
- Column 16 = chromosome
- Column 17 = left flanking region start
- Column 18 = left flanking region stop
-- Column 19 = microsatellite start as infer from pair-end
-- Column 20 = microsatellite stop as infer from pair-end
+- Column 19 = STR start as infer from pair-end
+- Column 20 = STR stop as infer from pair-end
- Column 21 = right flanking region start
- Column 22 = right flanking region stop
-- Column 23 = microsatellite length in reference
-- Column 24 = microsatellite sequence in reference
+- Column 23 = STR length in reference
+- Column 24 = STR sequence in reference
**Output**
@@ -73,4 +73,4 @@
-
\ No newline at end of file
+
diff -r ecfc9041bcc5 -r b27006b0a953 microsatellite.xml
--- a/microsatellite.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/microsatellite.xml Wed Apr 22 12:19:28 2015 -0400
@@ -1,4 +1,4 @@
-
+
for short read, reference, and mapped data
microsatellite.py
"${filePath}"
@@ -108,30 +108,13 @@
**What it does**
-We use different algorithms to detect microsatellites depend on hamming distance parameter.
-If hamming distance is set to zero, the program will only concern about uninterrupted microsatellites. The process works as follows.
-
-1) Scanning reads using sliding windows. For a given repeat period ‘k’ (e.g. k=2 for dinucleotide TRs), we compared consecutive k-mer window size sequences, with a step size of k. If a base at a given position matches one k positions earlier it was marked with a plus, if corresponding sites had different bases it was marked with a minus. The first k position is blank.
-
-2) Since we do not allow mutations in reported TR, consecutive “+” signal sequence means that a k-mer TR is present in this sample.
-
-3) Report k-mer TRs if the length is larger than a threshold provided by the user.
-
-If hamming distance is set to integer more than zero, the program will concern both uninterrupted and interrupted microsatellites. The process works as follows:
-
-(1) Identify intervals that are highly correlated with the interval shifted by ‘k’ (the repeat period). These intervals are called "runs" or "candidates". The allowed level of correlation is 6/7. Depending on whether we want to look for more than one microsat, we either find the longest such run (simple algorithm) or many runs (more complicated algorithm). The following steps are then performed on each run.
-
-(2) Find the most likely repeat motif in the run. This is done by counting all kmers (of length P) and choosing the most frequent. If that kmer is itself covered by a sub-repeat we discard this run. The idea is that we can ignore a 6-mer like ACGACG because we will find it when we are looking for 3-mers.
-
-(3) Once we identify the most likely repeat motif, we then modify the interval, adjusting start and end to find the interval that has the fewest mismatches vs. a sequence of the motif repeated (hamming distance).
-
-(4) At this point we have a valid microsat interval (in the eyes of the program). It is subjected to some filtering stages (hamming distance or too close to an end), and if it satisfies those conditions, it's reported to the user
-
-For more option, the script to run this program can be downloaded and run with python independently from Galaxy. There are more option for the script mode. Help page is build-in inside the script.
+This tool identifies simple as well interrupted STRs. Choosing a hamming distance of zero will return simple STRs.
+Choosing a hamming distance of greater than zero will return both simple and interrupted STRs.
+The algorithms used to identify simple and interrupted STRs are described oin the manuscript cited below (see TABLE XXXX).
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
This tool is developed by Chen Sun (cxs1031@cse.psu.edu) and Bob Harris (rsharris@bx.psu.edu)
**Input**
@@ -142,37 +125,37 @@
For fastq, the output will contain the following columns:
-- Column 1 = length of microsatellites (bp)
-- Column 2 = length of left flanking regions (bp)
-- Column 3 = length of right flanking regions (bp)
+- Column 1 = length of STR (bp)
+- Column 2 = length of left flanking region (bp)
+- Column 3 = length of right flanking region (bp)
- Column 4 = repeat motif (bp)
- Column 5 = hamming distance
- Column 6 = read name
-- Column 7 = read sequence with soft masking of microsatellites
+- Column 7 = read sequence with soft masking of STR
- Column 8 = read quality (the same Phred score scale as input)
For fasta, fastq without quality score and sam format, column 8 will be replaced with dot(.).
-If the users have mapped file (SAM) and would like to profile microsatellites from premapped data instead of using flank-based mapping approach, they can select SAM format input and specify that they want correspond microsatellites in reference for comparison. The output will be as follow:
+If the users have mapped file (SAM) and would like to profile STRs from premapped data instead of using flank-based mapping approach, they can select SAM format input and specify that they want correspond STRs in reference for comparison. The output will be as follow:
-- Column 1 = length of microsatellites (bp)
-- Column 2 = length of left flanking regions (bp)
-- Column 3 = length of right flanking regions (bp)
+- Column 1 = length of STR (bp)
+- Column 2 = length of left flanking region (bp)
+- Column 3 = length of right flanking region (bp)
- Column 4 = repeat motif (bp)
- Column 5 = hamming distance
- Column 6 = read name
-- Column 7 = read sequence with soft masking of microsatellites
+- Column 7 = read sequence with soft masking of STR
- Column 8 = read quality (the same Phred score scale as input)
- Column 9 = read name (The same as column 6)
- Column 10 = chromosome
- Column 11 = left flanking region start
- Column 12 = left flanking region stop
-- Column 13 = microsatellite start as infer from pair-end
-- Column 14 = microsatellite stop as infer from pair-end
+- Column 13 = STR start as infer from pair-end
+- Column 14 = STR stop as infer from pair-end
- Column 15 = right flanking region start
- Column 16 = right flanking region stop
-- Column 17 = microsatellite length in reference
-- Column 18 = microsatellite sequence in reference
+- Column 17 = STR length in reference
+- Column 18 = STR sequence in reference
diff -r ecfc9041bcc5 -r b27006b0a953 microsatpurity.xml
--- a/microsatpurity.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/microsatpurity.xml Wed Apr 22 12:19:28 2015 -0400
@@ -1,4 +1,4 @@
-
+
of a specific column
microsatpurity.py $input $period $column_n > $output
@@ -28,47 +28,47 @@
**What it does**
-This tool is used to select only the uninterrupted microsatellites. Interrupted microsatellites (e.g. ATATATATAATATAT) or sequences of microsatellites with non-microsatellite parts (e.g. ATATATATATG) will be removed.
+This tool is used to select only the uninterrupted STRs/microsatellites. Interrupted STRs (e.g. ATATATATAATATAT) or sequences of STRs with non-STR parts (e.g. ATATATATATG) will be removed.
-For TRFM pipeline (profiling microsatellites in short read data), this tool can be used to avoid the cases that flanking bases were misread as microsatellite. Thus, the read profile will only reflect the variation of TR length from expansion/contraction.
-For example, suppose that the sequence around microsatellite is AGCGACGaaaaaaGCGATCA. If we observe read with sequence AGCGACGaaaaaaaaaaGCGATCA, we can indicate that this is microsatellite expansion. However, if we observe AGCGACGaaaaaaaCGATCA, this is more like a substitution of G to A. These incidents can be removed with this tool.
-You can use the tool **combine mapped flaked bases** to get the microsatellites in reference that correspond to sequence between mapped reads. If the user map these reads around the uninterrupted microsatelites in reference, the corresponding sequences between these pairs should be the uninterrupted microsatellites regardless of expansion/contraction of microsatellites in short read data. However, if the substitution of flanking base or if the fluorescent signal from the previous run make it look like substitution, the corresponding sequences in reference in between the pairs will not be uninterrupted microsatellites. Thus this tool can remove those cases and keep only microsatellite expansion/contraction.
+As another application of this tool, specifically for STR-FM pipeline (profiling STRs in short read data), it can be used to avoid the cases where flanking bases were misread as STRs (sequencing errors). Thus, the remaining read profile will only reflect the variation of TR length from expansion/contraction.
+For example, suppose that the sequence around an STR in the reference genome is AGCGACGaaaaaaGCGATCA. If we observe a read with sequence AGCGACGaaaaaaaaaaGCGATCA, we can indicate that this is an STR expansion. However, if we observe another read with sequence AGCGACGaaaaaaaCGATCA, this is likely a substitution of G to A. Such incidents can be removed with this tool.
+You can use the tool **combine mapped flanking bases** to get the STRs in reference that correspond to sequence between mapped reads. If the user map these reads around the uninterrupted STRs in reference, the corresponding sequences between these pairs should be the uninterrupted STRs regardless of expansion/contraction of STRs in short read data. However, if the substitution of flanking base or if the fluorescent signal from the previous run make it look like substitution, the corresponding sequences in reference in between the pairs will not be uninterrupted STRs. Thus this tool can remove those cases and keep only STR expansion/contraction.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
**Input**
The input files can be any tab delimited file.
-If this tool is used in TRFM microsatellite profiling, it should contains:
+If this tool is used in STR-FM for STRs profiling, it should contains:
-- Column 1 = microsatellite location in reference chromosome
-- Column 2 = microsatellite location in reference start
-- Column 3 = microsatellite location in reference stop
-- Column 4 = microsatellite location in reference motif
-- Column 5 = microsatellite location in reference length
-- Column 6 = microsatellite location in reference motif size
-- Column 7 = length of microsatellites (bp)
-- Column 8 = length of left flanking regions (bp)
-- Column 9 = length of right flanking regions (bp)
+- Column 1 = STR location in reference chromosome
+- Column 2 = STR location in reference start
+- Column 3 = STR location in reference stop
+- Column 4 = STR location in reference motif
+- Column 5 = STR location in reference length
+- Column 6 = STR location in reference motif size
+- Column 7 = length of STR (bp)
+- Column 8 = length of left flanking region (bp)
+- Column 9 = length of right flanking region (bp)
- Column 10 = repeat motif (bp)
- Column 11 = hamming distance
- Column 12 = read name
-- Column 13 = read sequence with soft masking of microsatellites
+- Column 13 = read sequence with soft masking of STR
- Column 14 = read quality (the same Phred score scale as input)
- Column 15 = read name (The same as column 12)
- Column 16 = chromosome
- Column 17 = left flanking region start
- Column 18 = left flanking region stop
-- Column 19 = microsatellite start as infer from pair-end
-- Column 20 = microsatellite stop as infer from pair-end
+- Column 19 = STR start as infer from pair-end
+- Column 20 = STR stop as infer from pair-end
- Column 21 = right flanking region start
- Column 22 = right flanking region stop
-- Column 23 = microsatellite length in reference
-- Column 24 = microsatellite sequence in reference
+- Column 23 = STR length in reference
+- Column 24 = STR sequence in reference
**Output**
@@ -76,4 +76,4 @@
-
\ No newline at end of file
+
diff -r ecfc9041bcc5 -r b27006b0a953 probvalueforhetero.xml
--- a/probvalueforhetero.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/probvalueforhetero.xml Wed Apr 22 12:19:28 2015 -0400
@@ -28,34 +28,34 @@
**What it does**
-- This tool will calculate the probability that the allele combination can generated the given read profile. This tool is part of the pipeline to estimate minimum read depth.
-- The calculation of probability is very similar to the tool **Correct genotype for microsatellite errors**. However, this tool will restrict the calculation to only the allele combination indicated in input. Also, when it encounter allele combination that cannot be generated from error profile, the total probability will be zero instead of using base substitution rate.
+- This tool will calculate the probability that the allele combination can generated the given the STR length profile. This tool is part of the pipeline to estimate minimum read depth.
+- The calculation of probability is very similar to the tool **Correct genotype for STR errors**. However, this tool will restrict the calculation to only the allele combination indicated in input. Also, when it encounter allele combination that cannot be generated from error profile, the total probability will be zero instead of using base substitution rate.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
**Input**
-The input format is the same as output from **Correct genotype for microsatellite errors** tool.
+The input format is the same as output from **Correct genotype for STR errors** tool.
-- Column 1 = location of microsatellite locus.
-- Column 2 = length profile (length of microsatellite in each read that mapped to this location in comma separated format).
-- Column 3 = motif of microsatellite in this locus. The input file can contain more than three column.
-- Column 4 = homozygous/heterozygous label.
-- Column 5 = log based 10 of (the probability of homozygous/the probability of heterozygous)
-- Column 6 = Allele for most probable homozygous form.
-- Column 7 = Allele 1 for most probable heterozygous form.
-- Column 8 = Allele 2 for most probable heterozygous form.
+- Column 1 = location of STR locus.
+- Column 2 = length profile (length of STR in each read that mapped to this location in comma separated format).
+- Column 3 = motif of STR in this locus. The input file can contain more than three column.
+- Column 4 = homozygote/heterozygote label.
+- Column 5 = log based 10 of (the probability of homozygote/the probability of heterozygote)
+- Column 6 = Allele for most probable homozygote.
+- Column 7 = Allele 1 for most probable heterozygote.
+- Column 8 = Allele 2 for most probable heterozygote.
Only column 2,3,7,8 were used in calculation.
**Output**
-The output will be contain original eight column from the input. However, it will also add these following columns.
+The output will contain the original eight columns from the input and the following additional columns.
- Column 9 = Probability of the allele combination to generate given read profile.
-- Column 10 = Number of possible rearrangement of given read profile.
+- Column 10 = Number of possible rearrangements of the given read profile.
- Column 11 = Probability of the allele combination to generate read profile with any rearrangement (Product of column 9 and column 10)
- Column 12 = Read depth
@@ -63,4 +63,4 @@
-
\ No newline at end of file
+
diff -r ecfc9041bcc5 -r b27006b0a953 profilegenerator.xml
--- a/profilegenerator.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/profilegenerator.xml Wed Apr 22 12:19:28 2015 -0400
@@ -1,4 +1,4 @@
-
+
of the consecutive allele from given error profile
profilegenerator.py $error_profile $MOTIF $Maxdepth $minprob > $output
@@ -18,7 +18,7 @@
-
+
@@ -30,32 +30,33 @@
**What it does**
-This tool will generate all possible combination of observed read profile of the consecutive alleles from given error profile. The range of observed read length can be filtered to contain only those that are frequently occur using "Minimum error rate to be considered" parameter.
+This tool will generate all possible combination of observed STR length profiles of the consecutive alleles from given error profile. The range of observed read lengths can be filtered to contain only those that are frequently occur using "Minimum error rate to be considered" parameter.
-This problem will collect the lists of valid (pass "Minimum error rate to be considered" threshold) observed length profiles from combination of consecutive allele lengths. The lists that are equivalent or the subset of the other lists will be removed. For each depth and each list, length profile were generated from combination with replacement which compatible with python 2.7. There could be redundant error profiles generated from different lists if more than one combination of allele is generated due to overlap range of observed microsatellite lengths. The user need to remove them which can be done easily using **sort | uniq** command in unix.
+This program will collect the lists of valid (pass "Minimum error rate to be considered" threshold) observed length profiles from combination of consecutive allele lengths. The lists that are equivalent or the subset of the other lists will be removed. For each depth and each list, length profile were generated from combination with replacement which compatible with python 2.7. There could be redundant error profiles generated from different lists if more than one combination of allele is generated due to overlap range of observed microsatellite lengths. The user need to remove them which can be done easily using **sort | uniq** command in unix.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
**Input**
- The error profile needs to contain these three columns.
-- Column 1 = Correct microsatellite length
-- Column 2 = Observed microsatellite length
+- Column 1 = Correct STR length
+- Column 2 = Observed STR length
- Column 3 = Number of observation
**Output**
-- Column 1 = Place holder for location of microsatellite locus. (just "chr")
-- Column 2 = length profile (length of microsatellite in each read that mapped to this location in comma separated format).
-- Column 3 = motif of microsatellite in this locus.
+- Column 1 = Place holder for location of STR locus. (just "chr")
+- Column 2 = length profile (length of STR in each read that mapped to this location in comma separated format).
+- Column 3 = motif of STR in this locus.
**Example**
-- Suppose that we provide the following read profile ::
+- Suppose that we provide the following STR length profile ::
+ true obs. reads
9 9 100000
10 10 91456
10 9 1259
@@ -64,16 +65,15 @@
11 12 514
-- Using default minimum probability to be consider and motif = A, all observed read lengths are valid. The program will generated lists of observed length profiles from consecutive allele length. ::
+- Using the default minimum probability (fraction of reads) of 0.00000001 and motif = A, all observed STR lengths are valid. The program will generated lists of observed length profiles from consecutive allele lengths ::
9:10 = [9,10]
10:11 = [9,10,11,12]
-- Lists that are subsets of other lists will be removed. Thus, [9,10] will not be considered.
+- Lists that are subsets of other lists will be removed. In this example, [9,10] will not be considered.
-- Then the program will generate all combination with replacement for each depth from each list. Using **maximum read depth =3**, we will ge the following output. ::
+- The program will then generate all combinations with replacement for each depth from each list. Using **maximum read depth levels =3**, we will get the following output. ::
-
chr 9,9 A
chr 9,10 A
chr 9,11 A
@@ -107,4 +107,4 @@
-
\ No newline at end of file
+
diff -r ecfc9041bcc5 -r b27006b0a953 readdepth2sequencingdepth.xml
--- a/readdepth2sequencingdepth.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/readdepth2sequencingdepth.xml Wed Apr 22 12:19:28 2015 -0400
@@ -32,26 +32,27 @@
**What it does**
-This tool is used to convert informative read depth (specified by user) to sequencing depth when the microsatellites is mapped using TRFM pipeline.
-The locus specific sequencing depth is the sequencing depth that will make a certain loci have certain read depth based on uniform mapped of read. It is calculated as: ::
+This tool is used to convert informative read depth (specified by user) to sequencing depth when the STRs is mapped using STR-FM pipeline.
+The locus specific sequencing depth (yrequired) is the sequencing depth that will make an STR locus to have a certain informative read depth based on uniform mapping of reads. It is calculated as follows: ::
yrequired = ( X * L ) / (L - (2F+r-1))
-Where X = read depth, L = read length, F = the number of flanked bases required on each flanking regions, r = the expected repeat length of microsatellite of interest.
+where X = informative read depth, L = read length, F = the number of flanking bases required on either side, r = the expected repeat length of the STR of interest.
The genome wide sequencing depth is the sequencing depth that will make certain percentage of genome (e.g. 90 percent or 95 percent) to have certain locus specific sequencing depth. It's calculated using numerical guessing to find smallest lambda that: ::
0.90 (or other proportion specified by user) < = P(Y=0) + P(Y=1) + …+ P(Y=yrequired-1)
- P(Y=y) = (lambda^(y) * e ^(-lambda)) /y!
-
+ where P(Y=y) = (lambda^(y) * e ^(-lambda)) /y!
+
y = specific level of sequencing depth. Lambda = genome wide sequencing depth
-
+
+ Please refer the Methods section of the paper cited below for further details.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
-
\ No newline at end of file
+
diff -r ecfc9041bcc5 -r b27006b0a953 space2underscore_readname.xml
--- a/space2underscore_readname.xml Wed Apr 01 14:05:54 2015 -0400
+++ b/space2underscore_readname.xml Wed Apr 22 12:19:28 2015 -0400
@@ -1,5 +1,5 @@
- --change space to underscore of a specific column
+ --change space to underscore in the read name column
changespacetounderscore_readname.py $input $output $column_n
@@ -26,17 +26,18 @@
**What it does**
-This tool is used to change space to underscore. For TRFM pipeline (profiling microsatellites in short read data), this tool is used to change space in read name to underscore to prevent the downstream tools which might recognize incorrect column number due to space in read name. If the input do not have space in read name, this step can be skipped.
+The readname produced by the "STR detection" step may contain spaces instead of underscores, which will cause downstream tools that use space as a column delimiter to fail. This tool will help convert space to underscore.
+If your input does not have spaces in readname column, this step can be skipped.
**Citation**
-When you use this tool, please cite **Arkarachai Fungtammasan and Guruprasad Ananda (2014).**
+When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
**Input**
The input files can be any tab delimited file.
-If this tool is used in TRFM microsatellite profiling, it should be in the same format as output from **microsatellite detection program**. This format contains **length of repeat**, **length of left flanking region**, **length of right flanking region**, **repeat motif**, **hamming (editing) distance**, **read name**, **read sequence**, **read quality score**
+If this tool is used in STR-FM for STRs profiling, it should be in the same format as output from **STR detection program**. This format contains **length of repeat**, **length of left flanking region**, **length of right flanking region**, **repeat motif**, **hamming (editing) distance**, **read name**, **read sequence**, **read quality score**
**Output**
@@ -44,4 +45,4 @@
-
\ No newline at end of file
+