view microsatpurity.xml @ 1:99ec84eb0bab draft default tip

Uploaded
author arkarachai-fungtammasan
date Wed, 01 Apr 2015 17:00:21 -0400
parents 70f8259b0b30
children
line wrap: on
line source

<tool id="microsatpurity" name="Select uninterrupted microsatellites" version="1.0.0">
  <description> of a specific column</description>
  <command interpreter="python">microsatpurity.py $input $period $column_n > $output </command>

  <inputs>
    <param name="input" type="data" label="Select input" />
    <param name="period" type="integer" label="motif size" value="1"/>
    <param name="column_n" type="integer" value="0" label="Select column that contains microsatellites of interest (0 = last column)" />
  </inputs>
  <outputs>
    <data format="tabular" name="output" />
    
  </outputs>
  <tests>
    <!-- Test data with valid values -->
    <test>
      <param name="input" value="microsatpurity_in.txt"/>
      <param name="period" value="2"/>      
      <param name="column_n" value="0"/>
      <output name="output" file="microsatpurity_out.txt"/>
    </test>
    
  </tests>
  <help>


.. class:: infomark

**What it does**

This tool is used to select only the uninterrupted microsatellites. Interrupted microsatellites (e.g. ATATATATAATATAT) or sequences of microsatellites with non-microsatellite parts (e.g. ATATATATATG) will be removed.

For TRFM pipeline (profiling microsatellites in short read data), this tool can be used to avoid the cases that flanking bases were misread as microsatellite. Thus, the read profile will only reflect the variation of TR length from expansion/contraction.
For example, suppose that the sequence around microsatellite is AGCGACGaaaaaaGCGATCA. If we observe read with sequence AGCGACGaaaaaaaaaaGCGATCA, we can indicate that this is microsatellite expansion. However, if we observe AGCGACGaaaaaaaCGATCA, this is more like a substitution of G to A. These incidents can be removed with this tool.
You can use the tool **combine mapped flaked bases** to get the microsatellites in reference that correspond to sequence between mapped reads. If the user map these reads around the uninterrupted microsatelites in reference, the corresponding sequences between these pairs should be the uninterrupted microsatellites regardless of expansion/contraction of microsatellites in short read data. However, if the substitution of flanking base or if the fluorescent signal from the previous run make it look like substitution, the corresponding sequences in reference in between the pairs will not be uninterrupted microsatellites. Thus this tool can remove those cases and keep only microsatellite expansion/contraction.


**Citation**

When you use this tool, please cite **Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research**
 
**Input**

The input files can be any tab delimited file. 

If this tool is used in TRFM microsatellite profiling, it should contains:

- Column 1 = microsatellite location in reference chromosome
- Column 2 = microsatellite location in reference start
- Column 3 = microsatellite location in reference stop
- Column 4 = microsatellite location in reference motif
- Column 5 = microsatellite location in reference length
- Column 6 = microsatellite location in reference motif size
- Column 7 = length of microsatellites (bp)
- Column 8 = length of left flanking regions (bp)
- Column 9 = length of right flanking regions (bp)
- Column 10 = repeat motif (bp)
- Column 11 = hamming distance 
- Column 12 = read name
- Column 13 = read sequence with soft masking of microsatellites
- Column 14 = read quality (the same Phred score scale as input)
- Column 15 = read name (The same as column 12)
- Column 16 = chromosome 
- Column 17 = left flanking region start
- Column 18 = left flanking region stop
- Column 19 = microsatellite start as infer from pair-end
- Column 20 = microsatellite stop as infer from pair-end
- Column 21 = right flanking region start
- Column 22 = right flanking region stop
- Column 23 = microsatellite length in reference
- Column 24 = microsatellite sequence in reference

**Output**

The same as input format.


</help>
</tool>