Mercurial > repos > arkarachai-fungtammasan > str_fm
changeset 2:d5ed5c2e25c3 draft
Uploaded
author | arkarachai-fungtammasan |
---|---|
date | Wed, 22 Apr 2015 12:48:40 -0400 |
parents | f2bab38e3cbd |
children | 3d58c22ea6c9 |
files | GenotypeTRcorrection.py GenotypingSTR.xml PEsortedSAM2readprofile.xml README.md combineprobforallelecombination.xml commandline_sample_STR-FM_estimate_mininum_informative_Read_Depth commandline_sample_STR-FM_reference_profiling commandline_sample_STR-FM_shortread_profiling fetchflank.xml microsatcompat.xml microsatellite.xml microsatpurity.xml probvalueforhetero.xml profilegenerator.xml readdepth2sequencingdepth.xml space2underscore_readname.xml |
diffstat | 16 files changed, 431 insertions(+), 163 deletions(-) [+] |
line wrap: on
line diff
--- a/GenotypeTRcorrection.py Wed Apr 01 17:06:29 2015 -0400 +++ b/GenotypeTRcorrection.py Wed Apr 22 12:48:40 2015 -0400 @@ -2,7 +2,6 @@ import sys import collections, math import heapq -from galaxy import eggs @@ -205,7 +204,7 @@ ## scope filter ######################################### -######## prob calculation option ######## +######## prob calculation sector ######## ######################################### homozygous_collector=0 heterozygous_collector=0
--- a/GenotypingSTR.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/GenotypingSTR.xml Wed Apr 22 12:48:40 2015 -0400 @@ -1,5 +1,5 @@ -<tool id="GenotypeSTR" name="Correct genotype for microsatellite errors" version="2.0.0"> - <description> during sequencing and library prep </description> +<tool id="GenotypeSTR" name="Correct genotype for STR errors" version="2.0.0"> + <description> that occur during sequencing and library prep </description> <command interpreter="python2.7">GenotypeTRcorrection.py $microsat_raw $microsat_error_profile $microsat_corrected $expectedminorallele </command> <inputs> @@ -28,10 +28,10 @@ **What it does** -- This tool will correct for microsatellite sequencing and library preparation errors using error rates estimated from hemizygous male X chromosome or any rates provided by user. The read profile for each locus will be processed independently. -- First, this tool will find three most common read lengths from input read length profile. If the read profile has only one length of TR, the length of one motif longer than the observed length will be used as the second most common read length. -- Second, it will calculate probability of three forms of homozygous and use the form which give the highest probability. The same goes for heterozygous. -- Third, this tools will calculate log based 10 of (the probability of homozygous/the probability of heterozygous). If this value is more than 0, it will predict this locus to homozygous. If this value is less than 0, it will predict this locus to heterozygous. If this value is 0, read profile at this locus will be discard. +- This tool will correct for STR sequencing and library preparation errors using error rates estimated from hemizygous male X chromosome (https://usegalaxy.org/u/guru%40psu.edu/h/error-rates-files) or rates provided by user. The STR length profile for each locus will be processed independently. +- First, this tool will find three most common STR lengths from input STR length profile. If the STR length profile has only one length of STR, the length of one motif longer than the observed length will be used as the second most common STR length. +- Second, it will calculate probability of three forms of homozygotes and use the form with the highest probability. The same goes for heterozygotes. +- Third, this tools will calculate log10 of the ratio of the probability of homozygote to the probability of heterozygote. If this value is more than 0, it will predict this locus to be homozygote. If this value is less than 0, it will predict this locus to be heterozygote. If this value is 0, read profile at this locus will be discarded. **Citation** @@ -40,30 +40,30 @@ **Input** - The input files need to contain at least three columns. -- Column 1 = location of microsatellite locus. -- Column 2 = length profile (length of microsatellite in each read that mapped to this location in comma separated format). -- Column 3 = motif of microsatellite in this locus. The input file can contain more than three column. +- Column 1 = location of STR locus. +- Column 2 = length profile (length of STR in each read that mapped to this location in comma separated format). +- Column 3 = motif of STR in this locus. The input file can contain more than three columns. **Output** -The output will be contain original three (or more) column as the input. However, it will also have these following columns. +The output will be contain original three (or more) columns as the input. However, it will also have these following columns. -- Additional column 1 = homozygous/heterozygous label. -- Additional column 2 = log based 10 of (the probability of homozygous/the probability of heterozygous) -- Additional column 3 = Allele for most probable homozygous form. -- Additional column 4 = Allele 1 for most probable heterozygous form. -- Additional column 5 = Allele 2 for most probable heterozygous form. +- Additional column 1 = homozygote/heterozygote label. +- Additional column 2 = log based 10 of (the probability of homozygote/the probability of heterozygote) +- Additional column 3 = Allele for most probable homozygote. +- Additional column 4 = Allele 1 for most probable heterozygote. +- Additional column 5 = Allele 2 for most probable heterozygote. **Example** -- Suppose that we sequence one locus of microsatellite with NGS. This locus has **A** motif and the following length (bp) profile. :: +- Suppose that we sequence a locus of STR with NGS. This locus has **A** motif and the following STR length (bp) profile. :: chr1_100_106 5, 6, 6, 6, 6, 7, 7, 8, 8 A -- We want to figure out if this locus is a homolozygous or heterozygous and the corresponding allele(s). Therefore, we use this tool to refine genotype. -- This tool will calculate the probability of homozygous A6A6, A7A7, and A8A8 to generate observed length profile. Among this A7A7 has the highest probability. Therefore, we use this form as the representative for homozygous. -- Then, this tool will calculate the probability of heterozygous A6A7, A7A8, and A6A8 to generate observed length profile. Among this A6A8 has the highest probability. Therefore, we use this form as the representative for heterozygous. -- The A6A7 has higher probability than A7A7. Therefore, the program will report that this locus is a heterozygous locus. :: +- We want to figure out if this locus is a homoozygote or heterozygote and the corresponding allele(s). Therefore, we use this tool to refine genotype. +- This tool will calculate the probability of homozygote A6A6, A7A7, and A8A8 to generate the observed STR length profile. Among this A7A7 has the highest probability. Therefore, we use this form as the representative for homozygote. +- Then, this tool will calculate the probability of heterozygote A6A7, A7A8, and A6A8 to generate the observed STR length profile. Among this A6A8 has the highest probability. Therefore, we use this form as the representative for heterozygote. +- Finally, it will compare the representative homozygous and heterozygous forms. The A6A8 has higher probability than A7A7. Therefore, the program will report that this locus as a heterozygous locus of form A6A8. :: chr1 5,6,6,6,6,7,7,8,8 A hetero -14.8744881854 7 6 8
--- a/PEsortedSAM2readprofile.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/PEsortedSAM2readprofile.xml Wed Apr 22 12:48:40 2015 -0400 @@ -1,5 +1,5 @@ -<tool id="PEsortedSAM2readprofile" name="Combine mapped flaked bases" version="1.0.0"> - <description> from SAM file sorted by readname </description> +<tool id="PEsortedSAM2readprofile" name="Combine mapped faux paired-end reads" version="1.0.0"> + <description> and get the reference STR allele from the reference genome </description> <command interpreter="python2.7">PEsortedSAM2readprofile.py $flankedbasesSAM $twobitref $maxTRlength $maxoriginalreadlength $output </command> <inputs> @@ -30,8 +30,7 @@ **What it does** -- This tool will take SAM file sorted by read name, remove unpaired reads, report microsatellites sequences in the reference genome that correspond to the space between paired end reads. Coordinate of start and stop for left and right flanking regions of microsatellites and microsatellite itself as inferred from paired end reads will also be reported. -- These microsatellites in reference can be used to filter out reads that do not contain microsatellites that concur with microsatellites in reference where the reads mapped to. +- This tool will take SAM file (sorted by read name), remove unpaired reads, and combine paired faux read-pairs into a single row. It also reports Short Tandem Repeats (STRs) sequences in the reference genome that correspond to the space between the faux paired end reads and the coordinate of start and stop for left and right flanking regions of STRs. **Citation** @@ -39,22 +38,22 @@ **Input** -- Sorted SAM files by read name +- Sorted SAM files by read name. **Output** -The output will combined two lines of input which are paired. The output format is as follow. +The output will combine the two faux paired-end read lines of input ito the following single line format: - Column 1 = read name - Column 2 = chromosome - Column 3 = left flanking region start - Column 4 = left flanking region stop -- Column 5 = microsatellite start -- Column 6 = microsatellite stop +- Column 5 = STR start +- Column 6 = STR stop - Column 7 = right flanking region start - Column 8 = right flanking region stop -- Column 9 = microsatellite length in reference -- Column 10= microsatellite sequence in reference +- Column 9 = STR length in reference +- Column 10= STR sequence in reference
--- a/README.md Wed Apr 01 17:06:29 2015 -0400 +++ b/README.md Wed Apr 22 12:48:40 2015 -0400 @@ -1,1 +1,103 @@ -# STR-FM +# *STR-FM*, a short tandem repeat profiling using a flank-based mapping approach + +## User manual and guide +We designed the STR profiling pipeline as a collection of tools which can be executed in both commandline or via a GUI on Galaxy. The easiest way to use STR-FM pipeline is to via Galaxy platform. Current, we have all tools in Galaxy main toolshed (See installation of STR-FM tools from toolshed below) and in Galaxy test website (STR-FM: microsatellite analysis). + +## Overview + +Our tools in ‘str_fm’ can be used to: + +**(1) profile STRs from short read data with STR-FM pipeline** (tools: ‘STR detection’, ‘Read name modifier’, ‘Fetch bases flanking’, ‘Combine mapped faux paired-end reads’, ‘Check STR motif compatibility between reference and read STRs’, ‘Select uninterrupted STRs’) + +This pipeline needs several tools on Galaxy to complete the process. It can be customized with different mapper or STRs detection algorithm. Either single-end or paired-end sequencing data can be utilized; for paired-end read data, each read is treated separately. The core of the pipeline consists of the following three procedures + +First, STR-FM runs a short-read STR detection tool using a string comparison algorithm (see publication details). The algorithm can detect exact (pure, or uninterrupted) STRs (mono- through hexanucleotide STRs greater than or equal to two repeats), incomplete motifs (e.g., ATATATA), interrupted STRs (e.g., AAAATAAAAA), or multiple STRs in a read. Reads that do not have sufficient upstream or downstream sequences flanking the STRs are discarded (we used a threshold of 20 bp on each side of an STR). Each read is split into two “pseudoreads,” containing the upstream and downstream flanks surrounding the STR. + +Second, these are mapped to the reference genome using a standard paired-end read-mapping algorithm, e.g., BWA, Bowtie, or Bowtie2, treating each pair of flanking sequences as a faux paired-end read. + +Finally, STR-FM runs a profiler tool, which groups all reads with STRs that are mapped to the same location in the reference genome. As a result, an array of all STR lengths from the reads mapping to a particular STR-containing locus is generated. + +**(2) genotype STRs with error correction** (tool ‘Correct genotype for STR errors’) + +This pipeline needs only one of our tools to complete process. It will take STR-profile file and sequencine error rates file as inputs. The program will calculate the maximum likelihood of genotype for each STR locus in STR-profile file. Then it will report the mostly likely genotype and the log odds ratio between their probabilities, which can be interpreted as a confidence of genotyping (the more this value deviates from 0, the more confidence we have in this genotype). + +**(3) estimate the minimum informative read depth from error rates** (tools: ‘Generate all possible combination of STR length profile’, ‘Evaluate the probability of the allele combination to generate read profile’, ‘Combine read profile probabilities’) + +This pipeline needs other tools on Galaxy to complete the process. This pipeline will generate all possible read profiles from sequencing error spectrum, select the profiles that can distinguish heterozygote from homozygote, calculate the probability to produce such profiles from sequencing error spectrum, and report the probability that a certain sequence depth can distinguish heterozygote from homozygote under a given sequencing error rates (see publication details). We recommend that you should try to run with less than 10x depth for initial trial. + +**(4) convert informative read depth to locus-specific and genome-wide sequencing depth** (tool ‘Convert informative read depth to sequencing depth’). + +This pipeline needs only one of our tools to complete process. It will convert *informative read depth* to *locus-specific sequencing depth* (given read length) and *genome-wide sequencing depth* (given confidence intervals). + + +## Description of tools + +The short description for each tool is provided below. + +1. “STR detection” = Detect STRs from short reads (FASTQ), reference genome (FASTA), or alignments (SAM) +2. “Read name modifier” = Change space in read name to ‘_’ to prevent read name truncation by mapping tools +3. “Fetch bases flanking” = Generate two FASTQ files containing flanking bases around STRs for mapping as faux paired-end reads +4. “Combine mapped faux paired-end reads” = For each mapped faux paired-end reads, infer STR sequence in reference genome between the two mapped ends of the pair +5. “Check STR motif compatibility between reference and read STRs” = Check if two STRs have the same motif +6. “Select uninterrupted STRs” = Select STRs that do not contain an interruption +7. “Correct genotype for STR errors” = Build error correction model from pre-defined error rates and identify most likely genotype of the input data +8. “Generate all possible combination of STR length profile” = Use STR error spectrum to generate all possible combinations of read profile at each read depth +9. “Evaluate the probability of the allele combination to generate read profile” = Calculate the probability of a given genotype to generate read profiles (instead of finding most likely genotype like tool number 7) +10. “Combine read profile probabilities” = Sum the probability of the given allele combinations to generate read profile at certain read depth +11. “Convert informative read depth to sequencing depth” = Calculate ‘locus-specific’ and ‘genome-wide’ sequencing depth from the given informative read depth +The detailed description for each tool is embedded within the tool. + +## Citing *STR-FM* +Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research + +## Installation of STR-FM tools from toolshed + + +The installation can be done as follows + + +1 Install and set configuration of local Galaxy + +1.1 Download and install Galaxy (https://wiki.galaxyproject.org/Admin/GetGalaxy). Galaxy works on both Unix and Mac OS. + +1.2 From your Galaxy directory, add your E-mail as admin E-mail to the Galaxy configuration file. Depending on the Galaxy version, this file can be either universe_wsgi.ini or config/galaxy.ini (https://wiki.galaxyproject.org/Admin/Interface) + +1.3 Set directory for tool dependencies (step 2 in https://wiki.galaxyproject.org/Admin/Tools/AddToolFromToolShedTutorial). + +1.4 Run local Galaxy from the command line by running ‘sh run.sh’ from your Galaxy directory. + +1.5 Open your Galaxy from your browser at address http://localhost:8080 (https://wiki.galaxyproject.org/Admin/GetGalaxy) + +1.6 Register using your admin E-mail in the ‘User’ tab on the top. + +1.7 Refresh your browser + + +2 Install tools and dependencies + +2.1 From your local galaxy, click ‘Admin’ tab on the top. + +2.2 On the left panel, click ‘Search and browse tool sheds’ under ‘Tool sheds’. ‘Accessible Galaxy tool sheds’ will appear on main panel. + +2.3 Click on ‘Galaxy main tool shed’ and select ‘Browse valid repositories’. (https://wiki.galaxyproject.org/Admin/Tools/AddToolFromToolShedTutorial) + +2.4 Type ‘str_fm in search box and click enter. + +2.5 The ‘suite_str_fm_0_1’ repository that has ‘arkarachai-fungtammasan’ as the owner will appear. The user may click on this repository name and click ‘Preview and install’. The ‘Install to Galaxy’ button will appear on upper right corner. This button allows the user to install all our tools and workflows -- pipelines containing tools for specific purpose such as STR profiling from short read sequencing data, microsatellite detection of the reference genome, and estimating minimum informative read depth. None of our tools have any dependencies. However, some of the other tools that used in our workflows (e.g. SAM flag filter, unique element selection, etc.) are not included in the standard Galaxy installation. For the user’s convenience, we included all dependency tools for the workflows in this repository. Therefore, installing ‘suite_str_fm_0_1’ will be sufficient to operate all workflows we provided. + +2.6 After clicking on ‘Install to Galaxy’ and ‘Install’ button in confirmation page, all our tools, workflows, and test datasets will be downloaded to your local Galaxy. After the download is completed, all our tools will be available on your local Galaxy. If the user wants to use the workflows that we suggested (i.e. STR profiling from short read sequencing data, microsatellite detection of the reference genome, and estimating minimum informative read depth), please proceed to step 3. + +2.7 Refresh your browser + + +3 Install workflows + +3.1 Click on the ‘Admin’ tab at the top again. + +3.2 On the right panel, click ‘Manage installed tool shed repositories’ under ‘Server’. ‘Installed tool shed repositories’ will appear on main panel. + +3.3 Click to open ‘str_fm’ repository. + +3.4 Scroll down to ‘Workflows’ section and select the workflow that you want to install. The SGV graphic of the workflow will appear. + +3.5 Click on the ‘Repository Actions’ on the upper right corner and select ‘Import workflow to Galaxy’. If success, the ‘Workflow <workflow name> imported successfully’ will appear. Once the workflow is imported to your Galaxy, you can view and modify it from ‘Workflow’ tab on the top.
--- a/combineprobforallelecombination.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/combineprobforallelecombination.xml Wed Apr 22 12:48:40 2015 -0400 @@ -1,4 +1,4 @@ -<tool id="combineproballelecom" name="Combine probability to generate read profile " version="2.0.0"> +<tool id="combineproballelecom" name="Combine read profile probabilities " version="2.0.0"> <description>from the same allele combination</description> <command interpreter="python2.7">combinedprobforallelecombination.py $input > $output </command> @@ -24,7 +24,7 @@ **What it does** -- This tool will combine probability that the allele combination can generated any read profile in the input. This is the last step to calculate probability to detect heterozygous for each allele combination and each depth. +- This tool will combine the read profile probabilities for each allele combination in the input and calculates the probability to detect heterozygote for each allele combination and each depth. **Citation** @@ -34,30 +34,30 @@ The input format is the same as output from **Evaluate the probability of the allele combination to generate read profile** tool. -- Column 1 = location of microsatellite locus. -- Column 2 = length profile (length of microsatellite in each read that mapped to this location in comma separated format). -- Column 3 = motif of microsatellite in this locus. The input file can contain more than three column. -- Column 4 = homozygous/heterozygous label. -- Column 5 = log based 10 of (the probability of homozygous/the probability of heterozygous) -- Column 6 = Allele for most probable homozygous form. -- Column 7 = Allele 1 for most probable heterozygous form. -- Column 8 = Allele 2 for most probable heterozygous form. +- Column 1 = location of STR locus. +- Column 2 = length profile (length of STR in each read that mapped to this location in comma separated format). +- Column 3 = motif of STR in this locus. The input file can contain more than three columns. +- Column 4 = homozygote/heterozygote label. +- Column 5 = log based 10 of (the probability of homozygote/the probability of heterozygote) +- Column 6 = Allele for most probable homozygote. +- Column 7 = Allele 1 for most probable heterozygote. +- Column 8 = Allele 2 for most probable heterozygote. - Column 9 = Probability of the allele combination to generate given read profile. - Column 10 = Number of possible rearrangement of given read profile. - Column 11 = Probability of the allele combination to generate read profile with any rearrangement (Product of column 9 and column 10) - Column 12 = Read depth -Only column 2,3,4,7,8,11 were used in calculation. +Only columns 2,3,4,7,8,11 were used in calculation. **Output** -The output will contain the following header and column +The output will contain the following header and columns - Line 1 header: read_depth allele heterozygous_prob motif - Column 1 = read depth - Column 2 = allele combination -- Column 3 = probability to detect heterozygous of that allele combination +- Column 3 = probability to detect heterozygote of that allele combination - Column 4 = motif
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/commandline_sample_STR-FM_estimate_mininum_informative_Read_Depth Wed Apr 22 12:48:40 2015 -0400 @@ -0,0 +1,35 @@ +## This is a sample PBS script for profiling STR from reference genome using STR-FM +## +##requirement +##1 STR error rates (can be downloaded from https://usegalaxy.org/u/guru%40psu.edu/h/error-rates-files) --> errorrate.bymajorallele +## +echo " " +echo " " +echo "Job started on `hostname` at `date`" +cd /working/directory/ +echo ${MOTIF} +echo ${OUTPUT} +echo " " +echo "Generate all possible combination of STR length profile" ## See detail in profilegenerator.xml on https://github.com/Arkarachai/STR-FM +python profilegenerator.py errorrate.bymajorallele ${MOTIF} 30 > ${OUTPUT}.30 + +echo "remove duplicated profiles" +cat ${OUTPUT}.30 | sort | uniq > ${OUTPUT}.30.sort + +echo "genotyping using error correction model" ## See detail in GenotypingSTR.xml on https://github.com/Arkarachai/STR-FM +python GenotypeTRcorrection.py ${OUTPUT}.30.sort errorrate.bymajorallele ${OUTPUT}.30.prob 0.5 + +echo "select only full motif different --> need to replace 4 with motif size (1-6)" +cat ${OUTPUT}.30.prob | grep hetero | awk '(($7-$8)==4) || (($8-$7)==4) {print $0}' > ${OUTPUT}.30.prob.screen + +echo "Evaluate the probability of the allele combination to generate read profile" ## See detail in probvalueforhetero.xml on https://github.com/Arkarachai/STR-FM +python heteroprob.py ${OUTPUT}.30.prob.screen ${INPUT} > ${OUTPUT}.30.bino + +echo "formatting" +cat ${OUTPUT}.30.bino | sort -k 12n,12 -k 6n,6 > ${OUTPUT}.30.bino.sort + +echo "Combine read profile probabilities" ## See detail in combineprobforallelecombination.xml on https://github.com/Arkarachai/STR-FM +python combinedprobforallelecombination.py ${OUTPUT}.30.bino.sort > ${OUTPUT}.30.bino.sort.plot + + +echo "Job end on `hostname` at `date`"
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/commandline_sample_STR-FM_reference_profiling Wed Apr 22 12:48:40 2015 -0400 @@ -0,0 +1,25 @@ +## This is a sample PBS script for profiling STR from reference genome using STR-FM version 1.0.0 (April 20, 2014) +## +##requirement +##1 reference genome in FASTA format --> ${INPUT}.fa +## +echo " " +echo " " +echo "Job started on `hostname` at `date`" +cd /working/directory/ +echo " " +echo " detect STR in reference genome" ## See detail in microsatellite.xml on https://github.com/Arkarachai/STR-FM +python microsatellite.py ${INPUT}.fa --fasta --period=1 --partialmotifs --minlength=4 --prefix=0 --suffix=0 --hamming=0 --multipleruns --flankdisplay=0 --splitbyvalidity >${INPUT}.mono.out +python microsatellite.py ${INPUT}.fa --fasta --period=2 --partialmotifs --minlength=6 --prefix=0 --suffix=0 --hamming=0 --multipleruns --flankdisplay=0 --splitbyvalidity >${INPUT}.di.out +python microsatellite.py ${INPUT}.fa --fasta --period=3 --partialmotifs --minlength=6 --prefix=0 --suffix=0 --hamming=0 --multipleruns --flankdisplay=0 --splitbyvalidity >${INPUT}.tri.out +python microsatellite.py ${INPUT}.fa --fasta --period=4 --partialmotifs --minlength=8 --prefix=0 --suffix=0 --hamming=0 --multipleruns --flankdisplay=0 --splitbyvalidity >${INPUT}.tetra.out + +echo "formatting" +cat ${INPUT}.mono.out | awk 'BEGIN{FS="\t";OFS="\t"};{print $6,$2,$2+$1,$4,$1,length($4) }' > ${INPUT}.mono.TR +cat ${INPUT}.di.out | awk 'BEGIN{FS="\t";OFS="\t"};{print $6,$2,$2+$1,$4,$1,length($4) }' > ${INPUT}.di.TR +cat ${INPUT}.tri.out | awk 'BEGIN{FS="\t";OFS="\t"};{print $6,$2,$2+$1,$4,$1,length($4) }' > ${INPUT}.tri.TR +cat ${INPUT}.tetra.out | awk 'BEGIN{FS="\t";OFS="\t"};{print $6,$2,$2+$1,$4,$1,length($4) }' > ${INPUT}.tetra.TR + + + +echo "Job end on `hostname` at `date`"
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/commandline_sample_STR-FM_shortread_profiling Wed Apr 22 12:48:40 2015 -0400 @@ -0,0 +1,124 @@ +## This is a sample PBS script for profiling STR from short read using STR-FM version 2.0.0 (April 20, 2015) +## +##requirement +##1 fastq input in sangerfq Phred scale --> ${INPUT}.fastq +##2 index of mapping program (bwa, bowtie, etc) +##3 location of all STR in reference genome (use PBS script name "sampleSTR_reference_profiling.txt) --> /path/to/STR/in/reference/genome.TR (you can make 4 separated TR files for 4 types of STRs) +##4 reference genome in FASTA and in 2bit file --> /path/to/2bit/ref.2bit (use utility from UCSC genome browser to create 2bit file version of reference genome) +##5 local Galaxy (available from Galaxy website for Mac and Unix computer) +##6 STR error rates (can be downloaded from https://usegalaxy.org/u/guru%40psu.edu/h/error-rates-files) --> errorrate.bymajorallele +## +echo " " +echo " " +echo "Job started on `hostname` at `date`" +ref=/path/to/reference/sequence/and/bwa/index/ref.fa +export PYTHONPATH=/path/to/galaxy-dist/lib/ +galaxydir=/path/to/galaxy-dist/tools +cd /working/directory/ +echo " " +echo " detect STR in short read" ## See detail in microsatellite.xml on https://github.com/Arkarachai/STR-FM +python microsatellite.py ${INPUT}.fastq --fastq --period=1 --partialmotifs --minlength=5 --prefix=20 --suffix=20 --hamming=0 --multipleruns >${INPUT}.mono.out +python microsatellite.py ${INPUT}.fastq --fastq --period=2 --partialmotifs --minlength=6 --prefix=20 --suffix=20 --hamming=0 --multipleruns >${INPUT}.di.out +python microsatellite.py ${INPUT}.fastq --fastq --period=3 --partialmotifs --minlength=9 --prefix=20 --suffix=20 --hamming=0 --multipleruns >${INPUT}.tri.out +python microsatellite.py ${INPUT}.fastq --fastq --period=4 --partialmotifs --minlength=12 --prefix=20 --suffix=20 --hamming=0 --multipleruns >${INPUT}.tetra.out + +echo "change read name at " ## See detail in space2underscore_readname.xml on https://github.com/Arkarachai/STR-FM +python changespacetounderscore_readname.py ${INPUT}.mono.out ${INPUT}.mono.new 6 +python changespacetounderscore_readname.py ${INPUT}.di.out ${INPUT}.di.new 6 +python changespacetounderscore_readname.py ${INPUT}.tri.out ${INPUT}.tri.new 6 +python changespacetounderscore_readname.py ${INPUT}.tetra.out ${INPUT}.tetra.new 6 + +echo "start fetch flanking at `date`" ## See detail in fetchflank.xml on https://github.com/Arkarachai/STR-FM +python pair_fetch_DNA_ff.py ${INPUT}.mono.new ${INPUT}.mono_ff_L.txt ${INPUT}.mono_ff_R.txt 20 20 +python pair_fetch_DNA_ff.py ${INPUT}.di.new ${INPUT}.di_ff_L.txt ${INPUT}.di_ff_R.txt 20 20 +python pair_fetch_DNA_ff.py ${INPUT}.tri.new ${INPUT}.tri_ff_L.txt ${INPUT}.tri_ff_R.txt 20 20 +python pair_fetch_DNA_ff.py ${INPUT}.tetra.new ${INPUT}.tetra_ff_L.txt ${INPUT}.tetra_ff_R.txt 20 20 + +echo "BWA uniquely mapped no indel no deletion " +bwa aln -n 0 -o 0 ${ref} ${INPUT}.mono_ff_L.txt > ${INPUT}.mono_ff_L.sai +bwa aln -n 0 -o 0 ${ref} ${INPUT}.mono_ff_R.txt > ${INPUT}.mono_ff_R.sai +bwa sampe ${ref} ${INPUT}.mono_ff_L.sai ${INPUT}.mono_ff_R.sai ${INPUT}.mono_ff_L.txt ${INPUT}.mono_ff_R.txt > ${INPUT}.mono.sam +samtools view -Sb -F 12 -q 1 ${INPUT}.mono.sam > ${INPUT}.mono.n.all.bam +bwa aln -n 0 -o 0 ${ref} ${INPUT}.di_ff_L.txt > ${INPUT}.di_ff_L.sai +bwa aln -n 0 -o 0 ${ref} ${INPUT}.di_ff_R.txt > ${INPUT}.di_ff_R.sai +bwa sampe ${ref} ${INPUT}.di_ff_L.sai ${INPUT}.di_ff_R.sai ${INPUT}.di_ff_L.txt ${INPUT}.di_ff_R.txt > ${INPUT}.di.sam +samtools view -Sb -F 12 -q 1 ${INPUT}.di.sam > ${INPUT}.di.n.all.bam +bwa aln -n 0 -o 0 ${ref} ${INPUT}.tri_ff_L.txt > ${INPUT}.tri_ff_L.sai +bwa aln -n 0 -o 0 ${ref} ${INPUT}.tri_ff_R.txt > ${INPUT}.tri_ff_R.sai +bwa sampe ${ref} ${INPUT}.tri_ff_L.sai ${INPUT}.tri_ff_R.sai ${INPUT}.tri_ff_L.txt ${INPUT}.tri_ff_R.txt > ${INPUT}.tri.sam +samtools view -Sb -F 12 -q 1 ${INPUT}.tri.sam > ${INPUT}.tri.n.all.bam +bwa aln -n 0 -o 0 ${ref} ${INPUT}.tetra_ff_L.txt > ${INPUT}.tetra_ff_L.sai +bwa aln -n 0 -o 0 ${ref} ${INPUT}.tetra_ff_R.txt > ${INPUT}.tetra_ff_R.sai +bwa sampe ${ref} ${INPUT}.tetra_ff_L.sai ${INPUT}.tetra_ff_R.sai ${INPUT}.tetra_ff_L.txt ${INPUT}.tetra_ff_R.txt > ${INPUT}.tetra.sam +samtools view -Sb -F 12 -q 1 ${INPUT}.tetra.sam > ${INPUT}.tetra.n.all.bam + +echo "sort result by read name" +samtools sort -n ${INPUT}.mono.n.all.bam ${INPUT}.mono.n.sorted.all +samtools sort -n ${INPUT}.di.n.all.bam ${INPUT}.di.n.sorted.all +samtools sort -n ${INPUT}.tri.n.all.bam ${INPUT}.tri.n.sorted.all +samtools sort -n ${INPUT}.tetra.n.all.bam ${INPUT}.tetra.n.sorted.all +samtools view -h -o ${INPUT}.mono.n.sorted.all.sam ${INPUT}.mono.n.sorted.all.bam +samtools view -h -o ${INPUT}.di.n.sorted.all.sam ${INPUT}.di.n.sorted.all.bam +samtools view -h -o ${INPUT}.tri.n.sorted.all.sam ${INPUT}.tri.n.sorted.all.bam +samtools view -h -o ${INPUT}.tetra.n.sorted.all.sam ${INPUT}.tetra.n.sorted.all.bam + +echo "merge faux paired end reads" ## See detail in PEsortedSAM2readprofile.xml on https://github.com/Arkarachai/STR-FM +python PEsortedSAM2readprofile.py ${INPUT}.mono.n.sorted.all.sam /path/to/2bit/ref.2bit 100 250 ${INPUT}.mono.RF +python PEsortedSAM2readprofile.py ${INPUT}.di.n.sorted.all.sam /path/to/2bit/ref.2bit 100 250 ${INPUT}.mono.RF +python PEsortedSAM2readprofile.py ${INPUT}.tri.n.sorted.all.sam /path/to/2bit/ref.2bit 100 250 ${INPUT}.mono.RF +python PEsortedSAM2readprofile.py ${INPUT}.tetra.n.sorted.all.sam /path/to/2bit/ref.2bit 100 250 ${INPUT}.mono.RF + +echo "join mapped coordinate with STR length using read name" +python ${galaxydir}/filters/join.py ${INPUT}.mono.new ${INPUT}.mono.RF 6 1 ${INPUT}.mono.RF.j "" "" --index_depth=3 --buffer=50000000 --fill_options_file='None' +python ${galaxydir}/filters/join.py ${INPUT}.di.new ${INPUT}.di.RF 6 1 ${INPUT}.mono.RF.j "" "" --index_depth=3 --buffer=50000000 --fill_options_file='None' +python ${galaxydir}/filters/join.py ${INPUT}.tri.new ${INPUT}.tri.RF 6 1 ${INPUT}.mono.RF.j "" "" --index_depth=3 --buffer=50000000 --fill_options_file='None' +python ${galaxydir}/filters/join.py ${INPUT}.tetra.new ${INPUT}.tetra.RF 6 1 ${INPUT}.mono.RF.j "" "" --index_depth=3 --buffer=50000000 --fill_options_file='None' + +echo "join mapped coordinate and STR length with STR location in genome" +python ${galaxydir}/new_operations/gops_join.py /path/to/STR/in/reference/genome.TR ${INPUT}.mono.RF.j ${INPUT}.mono.gop -1 1,2,3,0 -2 10,13,14,0 -m 1 -f +python ${galaxydir}/new_operations/gops_join.py /path/to/STR/in/reference/genome.TR ${INPUT}.di.RF.j ${INPUT}.di.gop -1 1,2,3,0 -2 10,13,14,0 -m 1 -f +python ${galaxydir}/new_operations/gops_join.py /path/to/STR/in/reference/genome.TR ${INPUT}.tri.RF.j ${INPUT}.tri.gop -1 1,2,3,0 -2 10,13,14,0 -m 1 -f +python ${galaxydir}/new_operations/gops_join.py /path/to/STR/in/reference/genome.TR ${INPUT}.tetra.RF.j ${INPUT}.tetra.gop -1 1,2,3,0 -2 10,13,14,0 -m 1 -f + +echo "remove incompatible motif (remove incorrect mapped reads given that there is no STR motif difference from reference genome)" ## See detail in microsatcompat.xml on https://github.com/Arkarachai/STR-FM +python microsatcompat.py ${INPUT}.mono.gop 4 10 > ${INPUT}.mono.fulltable1 +python microsatcompat.py ${INPUT}.di.gop 4 10 > ${INPUT}.di.fulltable1 +python microsatcompat.py ${INPUT}.tri.gop 4 10 > ${INPUT}.tri.fulltable1 +python microsatcompat.py ${INPUT}.tetra.gop 4 10 > ${INPUT}.tetra.fulltable1 + +echo "remove shifting flanking location (remove cases that come from STR interruption or flanking bases are misread as STRs)" +cat ${INPUT}.mono.fulltable1 | awk '($19==$2) && ($20==$3) {print $0}' > ${INPUT}.mono.fulltable2 +cat ${INPUT}.di.fulltable1 | awk '($19==$2) && ($20==$3) {print $0}' > ${INPUT}.di.fulltable2 +cat ${INPUT}.tri.fulltable1 | awk '($19==$2) && ($20==$3) {print $0}' > ${INPUT}.tri.fulltable2 +cat ${INPUT}.tetra.fulltable1 | awk '($19==$2) && ($20==$3) {print $0}' > ${INPUT}.tetra.fulltable2 + +echo "keep only column that are necessary for profiling" +cat ${INPUT}.mono.fulltable2| cut -f 1,2,3,4,5,7 | sort -k 1n,1 -k 2n,2 -k 3n,3 > ${INPUT}.mono.cuttmp0 +cat ${INPUT}.di.fulltable2| cut -f 1,2,3,4,5,7 | sort -k 1n,1 -k 2n,2 -k 3n,3 > ${INPUT}.di.cuttmp0 +cat ${INPUT}.tri.fulltable2| cut -f 1,2,3,4,5,7 | sort -k 1n,1 -k 2n,2 -k 3n,3 > ${INPUT}.tri.cuttmp0 +cat ${INPUT}.tetra.fulltable2| cut -f 1,2,3,4,5,7 | sort -k 1n,1 -k 2n,2 -k 3n,3 > ${INPUT}.tetra.cuttmp0 + +echo "If you multiple analysis by splitting initial fastq, you should merge (cat) all results from the same sample after this step" + +echo "create genomic coordinate column and group by that column" +perl ${galaxydir}/filters/fixedValueColumn.pl ${INPUT}.mono.cuttmp0 ${INPUT}.mono.cuttmp1 "_" "no" +python ${galaxydir}/filters/mergeCols.py ${INPUT}.mono.cuttmp1 ${INPUT}.mono.cuttmp2 1 7 2 7 3 +python ${galaxydir}/stats/grouping.py ${INPUT}.mono.cuttmp3 ${INPUT}.mono.cuttmp2 8 0 'cat 6 0' 'cat_uniq 4 0' +perl ${galaxydir}/filters/fixedValueColumn.pl ${INPUT}.di.cuttmp0 ${INPUT}.di.cuttmp1 "_" "no" +python ${galaxydir}/filters/mergeCols.py ${INPUT}.di.cuttmp1 ${INPUT}.di.cuttmp2 1 7 2 7 3 +python ${galaxydir}/stats/grouping.py ${INPUT}.di.cuttmp3 ${INPUT}.di.cuttmp2 8 0 'cat 6 0' 'cat_uniq 4 0' +perl ${galaxydir}/filters/fixedValueColumn.pl ${INPUT}.tri.cuttmp0 ${INPUT}.tri.cuttmp1 "_" "no" +python ${galaxydir}/filters/mergeCols.py ${INPUT}.tri.cuttmp1 ${INPUT}.tri.cuttmp2 1 7 2 7 3 +python ${galaxydir}/stats/grouping.py ${INPUT}.tri.cuttmp3 ${INPUT}.tri.cuttmp2 8 0 'cat 6 0' 'cat_uniq 4 0' +perl ${galaxydir}/filters/fixedValueColumn.pl ${INPUT}.tetra.cuttmp0 ${INPUT}.tetra.cuttmp1 "_" "no" +python ${galaxydir}/filters/mergeCols.py ${INPUT}.tetra.cuttmp1 ${INPUT}.tetra.cuttmp2 1 7 2 7 3 +python ${galaxydir}/stats/grouping.py ${INPUT}.tetra.cuttmp3 ${INPUT}.tetra.cuttmp2 8 0 'cat 6 0' 'cat_uniq 4 0' + +echo "you may filter for minimum sequencing depth here" + +echo "genotyping using error correction model" ## See detail in GenotypingSTR.xml on https://github.com/Arkarachai/STR-FM +cat ${INPUT}.mono.cuttmp2 ${INPUT}.di.cuttmp2 ${INPUT}.tri.cuttmp2 ${INPUT}.tetra.cuttmp2 > ${INPUT}.step5 +python GenotypeTRcorrection.py ${INPUT}.step5 errorrate.bymajorallele ${INPUT}.step5.result 0.5 +## final output is ${INPUT}.step5.result + +echo "Job end on `hostname` at `date`"
--- a/fetchflank.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/fetchflank.xml Wed Apr 22 12:48:40 2015 -0400 @@ -1,5 +1,5 @@ -<tool id="fetchflank" name="Fetch flanking bases" version="1.0.0"> - <description> of microsatellites and output as two fastq files in forward-forward orientation</description> +<tool id="fetchflank" name="Fetch bases flanking" version="1.0.0"> + <description> the STRs in the reads and output two fastq files in forward-forward orientation</description> <command interpreter="python">pair_fetch_DNA_ff.py $microsat_in_read $Leftflanking $Rightflanking $qualitycutoff $lengthofbasetocheckquality </command> <inputs> @@ -29,7 +29,7 @@ **What it does** -This tool will fetch flanking regions around microsatellites, screen for quality score at microsatellites and adjacent flanking regions, and output two fastq files containing flanking regions in forward-forward direction. +This tool will fetch flanking regions around STRs from the reads output by "STR detection" step, screen for quality score at STRs and adjacent flanking regions, and output two fastq files containing flanking regions in forward-forward direction. - This tool assumes that the quality score is Phred+33, such as Sanger fastq. - Reads that have either left or right flanking regions shorter than the length of flanking regions that require quality screening will be removed. @@ -39,22 +39,20 @@ **Input** -The input files need to be in the same format as output from **microsatellite detection program**. This format contains **length of repeat**, **length of left flanking region**, **length of right flanking region**, **repeat motif**, **hamming (editing) distance**, **read name**, **read sequence**, **read quality score** +The input file needs to be in the same format as output from **STR detection** step. This format contains **length of repeat**, **length of left flanking region**, **length of right flanking region**, **repeat motif**, **hamming (editing) distance**, **read name**, **read sequence**, **read quality score** **Output** -The output will be the two fastq files. The first file contains left flank regions. The second file contains right flanking regions. +The output will be two fastq files. The first file contains left flanking bases. The second file contains right flanking bases. **Example** -- Suppose we detected the microsatellites from short reads :: +- Starting with this test input :: 6 40 54 G 0 SRR345592.75000006 HS2000-192_107:1:63:5822:176818_1_per1_1 TACCCTCCTGTCTTCCCAGACTGATTTCTGTTCCTGCCCTggggggTTCTTGACTCCTCTGAATGGGTACGGGAGTGTGGACCTCAGGGAGGCCCCCTTG GGGGGGGGGGGGGGGGGFGGGGGGGGGFEGGGGGGGGGGG?FFDFGGGGGG?FFFGGGGGDEGGEFFBEFCEEBD@BACB*?=99(/=5'6=4:CCC*AA -- We want to get fastq files of flanking regions around microsatellite with quality score at least 20 on Phred +33 - -- Then the program will report these two fastq files :: +- If we want to get fastq files of flanking regions around the detected STRs with quality score of at least 20, the program will report these two fastq files :: @SRR345592.75000006 HS2000-192_107:1:63:5822:176818_1_per1_1 TACCCTCCTGTCTTCCCAGACTGATTTCTGTTCCTGCCCT
--- a/microsatcompat.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/microsatcompat.xml Wed Apr 22 12:48:40 2015 -0400 @@ -1,4 +1,4 @@ -<tool id="microsatcompat" name="Check microsatellites motif compatibility" version="1.0.0"> +<tool id="microsatcompat" name="Check STR motif compatibility between reference and read STRs" version="1.0.0"> <description> </description> <command interpreter="python">microsatcompat.py $input $column1 $column2 > $output </command> @@ -28,9 +28,9 @@ **What it does** -This tool is used to select only the input lines which have compatible microsatellite motifs between two columns. Compatible here is defined as the microsatellites motif that are complementary or have the same sequence when change starting point of motif. For example, **A** is the same as **T**. Also, **AGG** is the same as **GAG**. +This tool is used to select only those input lines that have compatible STR motifs between the two user-specified columns. Two STR motifs are called compatible if they are either identical, or complementary, or produce the same sequence on rotating the start of the motif. For example, **A** is considered compatible with **A** and its reverse complement **T**. Similarly, **AGG** considered compatible with **AGG**, its reverse complement **TCC**, and their rotations **GGA**, **GAG**, **CCT** and **CTC**. -For TRFM pipeline (profiling microsatellites in short read data), this tool can be used to make sure that the microsatellites in the reads have the same motif as the microsatellites in the reference at the corresponding mapped location. +For STR-FM pipeline (profiling STRs in short read data), this tool can be used to make sure that the STRs in the reads have the compatible motif as the STRs in the reference at the corresponding mapped location. **Citation** @@ -40,32 +40,32 @@ The input files can be any tab delimited file. -If this tool is used in TRFM microsatellite profiling, it should contains: +If this tool is used in STR-FM pipeline for STRs profiling, it should contains: -- Column 1 = microsatellite location in reference chromosome -- Column 2 = microsatellite location in reference start -- Column 3 = microsatellite location in reference stop -- Column 4 = microsatellite location in reference motif -- Column 5 = microsatellite location in reference length -- Column 6 = microsatellite location in reference motif size -- Column 7 = length of microsatellites (bp) -- Column 8 = length of left flanking regions (bp) -- Column 9 = length of right flanking regions (bp) +- Column 1 = STR location in reference chromosome +- Column 2 = STR location in reference start +- Column 3 = STR location in reference stop +- Column 4 = STR location in reference motif +- Column 5 = STR location in reference length +- Column 6 = STR location in reference motif size +- Column 7 = length of STR (bp) +- Column 8 = length of left flanking region (bp) +- Column 9 = length of right flanking region (bp) - Column 10 = repeat motif (bp) - Column 11 = hamming distance - Column 12 = read name -- Column 13 = read sequence with soft masking of microsatellites +- Column 13 = read sequence with soft masking of STR - Column 14 = read quality (the same Phred score scale as input) - Column 15 = read name (The same as column 12) - Column 16 = chromosome - Column 17 = left flanking region start - Column 18 = left flanking region stop -- Column 19 = microsatellite start as infer from pair-end -- Column 20 = microsatellite stop as infer from pair-end +- Column 19 = STR start as infer from pair-end +- Column 20 = STR stop as infer from pair-end - Column 21 = right flanking region start - Column 22 = right flanking region stop -- Column 23 = microsatellite length in reference -- Column 24 = microsatellite sequence in reference +- Column 23 = STR length in reference +- Column 24 = STR sequence in reference **Output**
--- a/microsatellite.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/microsatellite.xml Wed Apr 22 12:48:40 2015 -0400 @@ -1,4 +1,4 @@ -<tool id="microsatellite" name="Microsatellite detection" version="1.0.0"> +<tool id="microsatellite" name="STR detection" version="1.0.0"> <description>for short read, reference, and mapped data</description> <command interpreter="python2.7"> microsatellite.py "${filePath}" @@ -91,6 +91,7 @@ <test> <param name="filePath" value="C_sample_fastq"/> <param name="period" value="1"/> + <param name="inputFileType" value="fastq"/> <param name="partialmotifs" value="true" /> <param name="minlength" value="3" /> <param name="prefix" value="5"/> @@ -108,26 +109,9 @@ **What it does** -We use different algorithms to detect microsatellites depend on hamming distance parameter. -If hamming distance is set to zero, the program will only concern about uninterrupted microsatellites. The process works as follows. - -1) Scanning reads using sliding windows. For a given repeat period ‘k’ (e.g. k=2 for dinucleotide TRs), we compared consecutive k-mer window size sequences, with a step size of k. If a base at a given position matches one k positions earlier it was marked with a plus, if corresponding sites had different bases it was marked with a minus. The first k position is blank. - -2) Since we do not allow mutations in reported TR, consecutive “+” signal sequence means that a k-mer TR is present in this sample. - -3) Report k-mer TRs if the length is larger than a threshold provided by the user. - -If hamming distance is set to integer more than zero, the program will concern both uninterrupted and interrupted microsatellites. The process works as follows: - -(1) Identify intervals that are highly correlated with the interval shifted by ‘k’ (the repeat period). These intervals are called "runs" or "candidates". The allowed level of correlation is 6/7. Depending on whether we want to look for more than one microsat, we either find the longest such run (simple algorithm) or many runs (more complicated algorithm). The following steps are then performed on each run. - -(2) Find the most likely repeat motif in the run. This is done by counting all kmers (of length P) and choosing the most frequent. If that kmer is itself covered by a sub-repeat we discard this run. The idea is that we can ignore a 6-mer like ACGACG because we will find it when we are looking for 3-mers. - -(3) Once we identify the most likely repeat motif, we then modify the interval, adjusting start and end to find the interval that has the fewest mismatches vs. a sequence of the motif repeated (hamming distance). - -(4) At this point we have a valid microsat interval (in the eyes of the program). It is subjected to some filtering stages (hamming distance or too close to an end), and if it satisfies those conditions, it's reported to the user - -For more option, the script to run this program can be downloaded and run with python independently from Galaxy. There are more option for the script mode. Help page is build-in inside the script. +This tool identifies simple as well interrupted STRs. Choosing a hamming distance of zero will return simple STRs. +Choosing a hamming distance of greater than zero will return both simple and interrupted STRs. +The algorithms used to identify simple and interrupted STRs are described oin the manuscript cited below (see TABLE XXXX). **Citation** @@ -142,37 +126,37 @@ For fastq, the output will contain the following columns: -- Column 1 = length of microsatellites (bp) -- Column 2 = length of left flanking regions (bp) -- Column 3 = length of right flanking regions (bp) +- Column 1 = length of STR (bp) +- Column 2 = length of left flanking region (bp) +- Column 3 = length of right flanking region (bp) - Column 4 = repeat motif (bp) - Column 5 = hamming distance - Column 6 = read name -- Column 7 = read sequence with soft masking of microsatellites +- Column 7 = read sequence with soft masking of STR - Column 8 = read quality (the same Phred score scale as input) For fasta, fastq without quality score and sam format, column 8 will be replaced with dot(.). -If the users have mapped file (SAM) and would like to profile microsatellites from premapped data instead of using flank-based mapping approach, they can select SAM format input and specify that they want correspond microsatellites in reference for comparison. The output will be as follow: +If the users have mapped file (SAM) and would like to profile STRs from premapped data instead of using flank-based mapping approach, they can select SAM format input and specify that they want correspond STRs in reference for comparison. The output will be as follow: -- Column 1 = length of microsatellites (bp) -- Column 2 = length of left flanking regions (bp) -- Column 3 = length of right flanking regions (bp) +- Column 1 = length of STR (bp) +- Column 2 = length of left flanking region (bp) +- Column 3 = length of right flanking region (bp) - Column 4 = repeat motif (bp) - Column 5 = hamming distance - Column 6 = read name -- Column 7 = read sequence with soft masking of microsatellites +- Column 7 = read sequence with soft masking of STR - Column 8 = read quality (the same Phred score scale as input) - Column 9 = read name (The same as column 6) - Column 10 = chromosome - Column 11 = left flanking region start - Column 12 = left flanking region stop -- Column 13 = microsatellite start as infer from pair-end -- Column 14 = microsatellite stop as infer from pair-end +- Column 13 = STR start as infer from pair-end +- Column 14 = STR stop as infer from pair-end - Column 15 = right flanking region start - Column 16 = right flanking region stop -- Column 17 = microsatellite length in reference -- Column 18 = microsatellite sequence in reference +- Column 17 = STR length in reference +- Column 18 = STR sequence in reference </help> </tool>
--- a/microsatpurity.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/microsatpurity.xml Wed Apr 22 12:48:40 2015 -0400 @@ -1,4 +1,4 @@ -<tool id="microsatpurity" name="Select uninterrupted microsatellites" version="1.0.0"> +<tool id="microsatpurity" name="Select uninterrupted STRs" version="1.0.0"> <description> of a specific column</description> <command interpreter="python">microsatpurity.py $input $period $column_n > $output </command> @@ -28,11 +28,11 @@ **What it does** -This tool is used to select only the uninterrupted microsatellites. Interrupted microsatellites (e.g. ATATATATAATATAT) or sequences of microsatellites with non-microsatellite parts (e.g. ATATATATATG) will be removed. +This tool is used to select only the uninterrupted STRs/microsatellites. Interrupted STRs (e.g. ATATATATAATATAT) or sequences of STRs with non-STR parts (e.g. ATATATATATG) will be removed. -For TRFM pipeline (profiling microsatellites in short read data), this tool can be used to avoid the cases that flanking bases were misread as microsatellite. Thus, the read profile will only reflect the variation of TR length from expansion/contraction. -For example, suppose that the sequence around microsatellite is AGCGACGaaaaaaGCGATCA. If we observe read with sequence AGCGACGaaaaaaaaaaGCGATCA, we can indicate that this is microsatellite expansion. However, if we observe AGCGACGaaaaaaaCGATCA, this is more like a substitution of G to A. These incidents can be removed with this tool. -You can use the tool **combine mapped flaked bases** to get the microsatellites in reference that correspond to sequence between mapped reads. If the user map these reads around the uninterrupted microsatelites in reference, the corresponding sequences between these pairs should be the uninterrupted microsatellites regardless of expansion/contraction of microsatellites in short read data. However, if the substitution of flanking base or if the fluorescent signal from the previous run make it look like substitution, the corresponding sequences in reference in between the pairs will not be uninterrupted microsatellites. Thus this tool can remove those cases and keep only microsatellite expansion/contraction. +As another application of this tool, specifically for STR-FM pipeline (profiling STRs in short read data), it can be used to avoid the cases where flanking bases were misread as STRs (sequencing errors). Thus, the remaining read profile will only reflect the variation of TR length from expansion/contraction. +For example, suppose that the sequence around an STR in the reference genome is AGCGACGaaaaaaGCGATCA. If we observe a read with sequence AGCGACGaaaaaaaaaaGCGATCA, we can indicate that this is an STR expansion. However, if we observe another read with sequence AGCGACGaaaaaaaCGATCA, this is likely a substitution of G to A. Such incidents can be removed with this tool. +You can use the tool **combine mapped flanking bases** to get the STRs in reference that correspond to sequence between mapped reads. If the user map these reads around the uninterrupted STRs in reference, the corresponding sequences between these pairs should be the uninterrupted STRs regardless of expansion/contraction of STRs in short read data. However, if the substitution of flanking base or if the fluorescent signal from the previous run make it look like substitution, the corresponding sequences in reference in between the pairs will not be uninterrupted STRs. Thus this tool can remove those cases and keep only STR expansion/contraction. **Citation** @@ -43,32 +43,32 @@ The input files can be any tab delimited file. -If this tool is used in TRFM microsatellite profiling, it should contains: +If this tool is used in STR-FM for STRs profiling, it should contains: -- Column 1 = microsatellite location in reference chromosome -- Column 2 = microsatellite location in reference start -- Column 3 = microsatellite location in reference stop -- Column 4 = microsatellite location in reference motif -- Column 5 = microsatellite location in reference length -- Column 6 = microsatellite location in reference motif size -- Column 7 = length of microsatellites (bp) -- Column 8 = length of left flanking regions (bp) -- Column 9 = length of right flanking regions (bp) +- Column 1 = STR location in reference chromosome +- Column 2 = STR location in reference start +- Column 3 = STR location in reference stop +- Column 4 = STR location in reference motif +- Column 5 = STR location in reference length +- Column 6 = STR location in reference motif size +- Column 7 = length of STR (bp) +- Column 8 = length of left flanking region (bp) +- Column 9 = length of right flanking region (bp) - Column 10 = repeat motif (bp) - Column 11 = hamming distance - Column 12 = read name -- Column 13 = read sequence with soft masking of microsatellites +- Column 13 = read sequence with soft masking of STR - Column 14 = read quality (the same Phred score scale as input) - Column 15 = read name (The same as column 12) - Column 16 = chromosome - Column 17 = left flanking region start - Column 18 = left flanking region stop -- Column 19 = microsatellite start as infer from pair-end -- Column 20 = microsatellite stop as infer from pair-end +- Column 19 = STR start as infer from pair-end +- Column 20 = STR stop as infer from pair-end - Column 21 = right flanking region start - Column 22 = right flanking region stop -- Column 23 = microsatellite length in reference -- Column 24 = microsatellite sequence in reference +- Column 23 = STR length in reference +- Column 24 = STR sequence in reference **Output**
--- a/probvalueforhetero.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/probvalueforhetero.xml Wed Apr 22 12:48:40 2015 -0400 @@ -28,8 +28,8 @@ **What it does** -- This tool will calculate the probability that the allele combination can generated the given read profile. This tool is part of the pipeline to estimate minimum read depth. -- The calculation of probability is very similar to the tool **Correct genotype for microsatellite errors**. However, this tool will restrict the calculation to only the allele combination indicated in input. Also, when it encounter allele combination that cannot be generated from error profile, the total probability will be zero instead of using base substitution rate. +- This tool will calculate the probability that the allele combination can generated the given the STR length profile. This tool is part of the pipeline to estimate minimum read depth. +- The calculation of probability is very similar to the tool **Correct genotype for STR errors**. However, this tool will restrict the calculation to only the allele combination indicated in input. Also, when it encounter allele combination that cannot be generated from error profile, the total probability will be zero instead of using base substitution rate. **Citation** @@ -37,25 +37,25 @@ **Input** -The input format is the same as output from **Correct genotype for microsatellite errors** tool. +The input format is the same as output from **Correct genotype for STR errors** tool. -- Column 1 = location of microsatellite locus. -- Column 2 = length profile (length of microsatellite in each read that mapped to this location in comma separated format). -- Column 3 = motif of microsatellite in this locus. The input file can contain more than three column. -- Column 4 = homozygous/heterozygous label. -- Column 5 = log based 10 of (the probability of homozygous/the probability of heterozygous) -- Column 6 = Allele for most probable homozygous form. -- Column 7 = Allele 1 for most probable heterozygous form. -- Column 8 = Allele 2 for most probable heterozygous form. +- Column 1 = location of STR locus. +- Column 2 = length profile (length of STR in each read that mapped to this location in comma separated format). +- Column 3 = motif of STR in this locus. The input file can contain more than three column. +- Column 4 = homozygote/heterozygote label. +- Column 5 = log based 10 of (the probability of homozygote/the probability of heterozygote) +- Column 6 = Allele for most probable homozygote. +- Column 7 = Allele 1 for most probable heterozygote. +- Column 8 = Allele 2 for most probable heterozygote. Only column 2,3,7,8 were used in calculation. **Output** -The output will be contain original eight column from the input. However, it will also add these following columns. +The output will contain the original eight columns from the input and the following additional columns. - Column 9 = Probability of the allele combination to generate given read profile. -- Column 10 = Number of possible rearrangement of given read profile. +- Column 10 = Number of possible rearrangements of the given read profile. - Column 11 = Probability of the allele combination to generate read profile with any rearrangement (Product of column 9 and column 10) - Column 12 = Read depth
--- a/profilegenerator.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/profilegenerator.xml Wed Apr 22 12:48:40 2015 -0400 @@ -1,4 +1,4 @@ -<tool id="Profilegenerator" name="Generate all possible combination of read profile" version="2.0.0"> +<tool id="Profilegenerator" name="Generate all possible combination of STR length profile" version="2.0.0"> <description> of the consecutive allele from given error profile </description> <command interpreter="python2.7">profilegenerator.py $error_profile $MOTIF $Maxdepth $minprob > $output </command> @@ -18,7 +18,7 @@ <param name="error_profile" value="sampleprofilegenerator_in"/> <param name="MOTIF" value="A"/> <param name="Maxdepth" value="3"/> - <param name="minprob" file="0.00000001"/> + <param name="minprob" value="0.00000001"/> <output name="output" file="sampleprofilegenerator_out"/> </test> @@ -30,9 +30,9 @@ **What it does** -This tool will generate all possible combination of observed read profile of the consecutive alleles from given error profile. The range of observed read length can be filtered to contain only those that are frequently occur using "Minimum error rate to be considered" parameter. +This tool will generate all possible combination of observed STR length profiles of the consecutive alleles from given error profile. The range of observed read lengths can be filtered to contain only those that are frequently occur using "Minimum error rate to be considered" parameter. -This problem will collect the lists of valid (pass "Minimum error rate to be considered" threshold) observed length profiles from combination of consecutive allele lengths. The lists that are equivalent or the subset of the other lists will be removed. For each depth and each list, length profile were generated from combination with replacement which compatible with python 2.7. There could be redundant error profiles generated from different lists if more than one combination of allele is generated due to overlap range of observed microsatellite lengths. The user need to remove them which can be done easily using **sort | uniq** command in unix. +This program will collect the lists of valid (pass "Minimum error rate to be considered" threshold) observed length profiles from combination of consecutive allele lengths. The lists that are equivalent or the subset of the other lists will be removed. For each depth and each list, length profile were generated from combination with replacement which compatible with python 2.7. There could be redundant error profiles generated from different lists if more than one combination of allele is generated due to overlap range of observed microsatellite lengths. The user need to remove them which can be done easily using **sort | uniq** command in unix. **Citation** @@ -42,20 +42,21 @@ **Input** - The error profile needs to contain these three columns. -- Column 1 = Correct microsatellite length -- Column 2 = Observed microsatellite length +- Column 1 = Correct STR length +- Column 2 = Observed STR length - Column 3 = Number of observation **Output** -- Column 1 = Place holder for location of microsatellite locus. (just "chr") -- Column 2 = length profile (length of microsatellite in each read that mapped to this location in comma separated format). -- Column 3 = motif of microsatellite in this locus. +- Column 1 = Place holder for location of STR locus. (just "chr") +- Column 2 = length profile (length of STR in each read that mapped to this location in comma separated format). +- Column 3 = motif of STR in this locus. **Example** -- Suppose that we provide the following read profile :: +- Suppose that we provide the following STR length profile :: + true obs. reads 9 9 100000 10 10 91456 10 9 1259 @@ -64,16 +65,15 @@ 11 12 514 -- Using default minimum probability to be consider and motif = A, all observed read lengths are valid. The program will generated lists of observed length profiles from consecutive allele length. :: +- Using the default minimum probability (fraction of reads) of 0.00000001 and motif = A, all observed STR lengths are valid. The program will generated lists of observed length profiles from consecutive allele lengths :: 9:10 = [9,10] 10:11 = [9,10,11,12] -- Lists that are subsets of other lists will be removed. Thus, [9,10] will not be considered. +- Lists that are subsets of other lists will be removed. In this example, [9,10] will not be considered. -- Then the program will generate all combination with replacement for each depth from each list. Using **maximum read depth =3**, we will ge the following output. :: +- The program will then generate all combinations with replacement for each depth from each list. Using **maximum read depth levels =3**, we will get the following output. :: - chr 9,9 A chr 9,10 A chr 9,11 A
--- a/readdepth2sequencingdepth.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/readdepth2sequencingdepth.xml Wed Apr 22 12:48:40 2015 -0400 @@ -32,21 +32,22 @@ **What it does** -This tool is used to convert informative read depth (specified by user) to sequencing depth when the microsatellites is mapped using TRFM pipeline. -The locus specific sequencing depth is the sequencing depth that will make a certain loci have certain read depth based on uniform mapped of read. It is calculated as: :: +This tool is used to convert informative read depth (specified by user) to sequencing depth when the STRs is mapped using STR-FM pipeline. +The locus specific sequencing depth (yrequired) is the sequencing depth that will make an STR locus to have a certain informative read depth based on uniform mapping of reads. It is calculated as follows: :: yrequired = ( X * L ) / (L - (2F+r-1)) -Where X = read depth, L = read length, F = the number of flanked bases required on each flanking regions, r = the expected repeat length of microsatellite of interest. +where X = informative read depth, L = read length, F = the number of flanking bases required on either side, r = the expected repeat length of the STR of interest. The genome wide sequencing depth is the sequencing depth that will make certain percentage of genome (e.g. 90 percent or 95 percent) to have certain locus specific sequencing depth. It's calculated using numerical guessing to find smallest lambda that: :: 0.90 (or other proportion specified by user) < = P(Y=0) + P(Y=1) + …+ P(Y=yrequired-1) - P(Y=y) = (lambda^(y) * e ^(-lambda)) /y! - + where P(Y=y) = (lambda^(y) * e ^(-lambda)) /y! + y = specific level of sequencing depth. Lambda = genome wide sequencing depth - + + Please refer the Methods section of the paper cited below for further details. **Citation**
--- a/space2underscore_readname.xml Wed Apr 01 17:06:29 2015 -0400 +++ b/space2underscore_readname.xml Wed Apr 22 12:48:40 2015 -0400 @@ -1,5 +1,5 @@ <tool id="space2underscore_readname" name="Read name modifier" version="1.0.0"> - <description>--change space to underscore of a specific column</description> + <description>--change space to underscore in the read name column</description> <command interpreter="python">changespacetounderscore_readname.py $input $output $column_n </command> <inputs> @@ -26,7 +26,8 @@ **What it does** -This tool is used to change space to underscore. For TRFM pipeline (profiling microsatellites in short read data), this tool is used to change space in read name to underscore to prevent the downstream tools which might recognize incorrect column number due to space in read name. If the input do not have space in read name, this step can be skipped. +The readname produced by the "STR detection" step may contain spaces instead of underscores, which will cause downstream tools that use space as a column delimiter to fail. This tool will help convert space to underscore. +If your input does not have spaces in readname column, this step can be skipped. **Citation** @@ -36,7 +37,7 @@ The input files can be any tab delimited file. -If this tool is used in TRFM microsatellite profiling, it should be in the same format as output from **microsatellite detection program**. This format contains **length of repeat**, **length of left flanking region**, **length of right flanking region**, **repeat motif**, **hamming (editing) distance**, **read name**, **read sequence**, **read quality score** +If this tool is used in STR-FM for STRs profiling, it should be in the same format as output from **STR detection program**. This format contains **length of repeat**, **length of left flanking region**, **length of right flanking region**, **repeat motif**, **hamming (editing) distance**, **read name**, **read sequence**, **read quality score** **Output**