annotate README.md @ 13:35aedbe548b9 draft

Uploaded
author arkarachai-fungtammasan
date Sun, 24 Jul 2016 17:56:49 -0400
parents d5ed5c2e25c3
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
1 # *STR-FM*, a short tandem repeat profiling using a flank-based mapping approach
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
2
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
3 ## User manual and guide
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
4 We designed the STR profiling pipeline as a collection of tools which can be executed in both commandline or via a GUI on Galaxy. The easiest way to use STR-FM pipeline is to via Galaxy platform. Current, we have all tools in Galaxy main toolshed (See installation of STR-FM tools from toolshed below) and in Galaxy test website (STR-FM: microsatellite analysis).
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
5
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
6 ## Overview
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
7
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
8 Our tools in ‘str_fm’ can be used to:
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
9
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
10 **(1) profile STRs from short read data with STR-FM pipeline** (tools: ‘STR detection’, ‘Read name modifier’, ‘Fetch bases flanking’, ‘Combine mapped faux paired-end reads’, ‘Check STR motif compatibility between reference and read STRs’, ‘Select uninterrupted STRs’)
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
11
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
12 This pipeline needs several tools on Galaxy to complete the process. It can be customized with different mapper or STRs detection algorithm. Either single-end or paired-end sequencing data can be utilized; for paired-end read data, each read is treated separately. The core of the pipeline consists of the following three procedures
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
13
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
14 First, STR-FM runs a short-read STR detection tool using a string comparison algorithm (see publication details). The algorithm can detect exact (pure, or uninterrupted) STRs (mono- through hexanucleotide STRs greater than or equal to two repeats), incomplete motifs (e.g., ATATATA), interrupted STRs (e.g., AAAATAAAAA), or multiple STRs in a read. Reads that do not have sufficient upstream or downstream sequences flanking the STRs are discarded (we used a threshold of 20 bp on each side of an STR). Each read is split into two “pseudoreads,” containing the upstream and downstream flanks surrounding the STR.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
15
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
16 Second, these are mapped to the reference genome using a standard paired-end read-mapping algorithm, e.g., BWA, Bowtie, or Bowtie2, treating each pair of flanking sequences as a faux paired-end read.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
17
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
18 Finally, STR-FM runs a profiler tool, which groups all reads with STRs that are mapped to the same location in the reference genome. As a result, an array of all STR lengths from the reads mapping to a particular STR-containing locus is generated.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
19
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
20 **(2) genotype STRs with error correction** (tool ‘Correct genotype for STR errors’)
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
21
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
22 This pipeline needs only one of our tools to complete process. It will take STR-profile file and sequencine error rates file as inputs. The program will calculate the maximum likelihood of genotype for each STR locus in STR-profile file. Then it will report the mostly likely genotype and the log odds ratio between their probabilities, which can be interpreted as a confidence of genotyping (the more this value deviates from 0, the more confidence we have in this genotype).
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
23
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
24 **(3) estimate the minimum informative read depth from error rates** (tools: ‘Generate all possible combination of STR length profile’, ‘Evaluate the probability of the allele combination to generate read profile’, ‘Combine read profile probabilities’)
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
25
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
26 This pipeline needs other tools on Galaxy to complete the process. This pipeline will generate all possible read profiles from sequencing error spectrum, select the profiles that can distinguish heterozygote from homozygote, calculate the probability to produce such profiles from sequencing error spectrum, and report the probability that a certain sequence depth can distinguish heterozygote from homozygote under a given sequencing error rates (see publication details). We recommend that you should try to run with less than 10x depth for initial trial.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
27
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
28 **(4) convert informative read depth to locus-specific and genome-wide sequencing depth** (tool ‘Convert informative read depth to sequencing depth’).
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
29
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
30 This pipeline needs only one of our tools to complete process. It will convert *informative read depth* to *locus-specific sequencing depth* (given read length) and *genome-wide sequencing depth* (given confidence intervals).
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
31
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
32
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
33 ## Description of tools
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
34
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
35 The short description for each tool is provided below.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
36
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
37 1. “STR detection” = Detect STRs from short reads (FASTQ), reference genome (FASTA), or alignments (SAM)
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
38 2. “Read name modifier” = Change space in read name to ‘_’ to prevent read name truncation by mapping tools
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
39 3. “Fetch bases flanking” = Generate two FASTQ files containing flanking bases around STRs for mapping as faux paired-end reads
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
40 4. “Combine mapped faux paired-end reads” = For each mapped faux paired-end reads, infer STR sequence in reference genome between the two mapped ends of the pair
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
41 5. “Check STR motif compatibility between reference and read STRs” = Check if two STRs have the same motif
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
42 6. “Select uninterrupted STRs” = Select STRs that do not contain an interruption
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
43 7. “Correct genotype for STR errors” = Build error correction model from pre-defined error rates and identify most likely genotype of the input data
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
44 8. “Generate all possible combination of STR length profile” = Use STR error spectrum to generate all possible combinations of read profile at each read depth
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
45 9. “Evaluate the probability of the allele combination to generate read profile” = Calculate the probability of a given genotype to generate read profiles (instead of finding most likely genotype like tool number 7)
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
46 10. “Combine read profile probabilities” = Sum the probability of the given allele combinations to generate read profile at certain read depth
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
47 11. “Convert informative read depth to sequencing depth” = Calculate ‘locus-specific’ and ‘genome-wide’ sequencing depth from the given informative read depth
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
48 The detailed description for each tool is embedded within the tool.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
49
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
50 ## Citing *STR-FM*
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
51 Fungtammasan A, Ananda G, Hile SE, Su MS, Sun C, Harris R, Medvedev P, Eckert K, Makova KD. 2015. Accurate Typing of Short Tandem Repeats from Genome-wide Sequencing Data and its Applications, Genome Research
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
52
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
53 ## Installation of STR-FM tools from toolshed
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
54
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
55
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
56 The installation can be done as follows
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
57
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
58
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
59 1 Install and set configuration of local Galaxy
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
60
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
61 1.1 Download and install Galaxy (https://wiki.galaxyproject.org/Admin/GetGalaxy). Galaxy works on both Unix and Mac OS.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
62
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
63 1.2 From your Galaxy directory, add your E-mail as admin E-mail to the Galaxy configuration file. Depending on the Galaxy version, this file can be either universe_wsgi.ini or config/galaxy.ini (https://wiki.galaxyproject.org/Admin/Interface)
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
64
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
65 1.3 Set directory for tool dependencies (step 2 in https://wiki.galaxyproject.org/Admin/Tools/AddToolFromToolShedTutorial).
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
66
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
67 1.4 Run local Galaxy from the command line by running ‘sh run.sh’ from your Galaxy directory.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
68
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
69 1.5 Open your Galaxy from your browser at address http://localhost:8080 (https://wiki.galaxyproject.org/Admin/GetGalaxy)
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
70
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
71 1.6 Register using your admin E-mail in the ‘User’ tab on the top.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
72
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
73 1.7 Refresh your browser
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
74
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
75
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
76 2 Install tools and dependencies
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
77
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
78 2.1 From your local galaxy, click ‘Admin’ tab on the top.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
79
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
80 2.2 On the left panel, click ‘Search and browse tool sheds’ under ‘Tool sheds’. ‘Accessible Galaxy tool sheds’ will appear on main panel.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
81
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
82 2.3 Click on ‘Galaxy main tool shed’ and select ‘Browse valid repositories’. (https://wiki.galaxyproject.org/Admin/Tools/AddToolFromToolShedTutorial)
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
83
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
84 2.4 Type ‘str_fm in search box and click enter.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
85
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
86 2.5 The ‘suite_str_fm_0_1’ repository that has ‘arkarachai-fungtammasan’ as the owner will appear. The user may click on this repository name and click ‘Preview and install’. The ‘Install to Galaxy’ button will appear on upper right corner. This button allows the user to install all our tools and workflows -- pipelines containing tools for specific purpose such as STR profiling from short read sequencing data, microsatellite detection of the reference genome, and estimating minimum informative read depth. None of our tools have any dependencies. However, some of the other tools that used in our workflows (e.g. SAM flag filter, unique element selection, etc.) are not included in the standard Galaxy installation. For the user’s convenience, we included all dependency tools for the workflows in this repository. Therefore, installing ‘suite_str_fm_0_1’ will be sufficient to operate all workflows we provided.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
87
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
88 2.6 After clicking on ‘Install to Galaxy’ and ‘Install’ button in confirmation page, all our tools, workflows, and test datasets will be downloaded to your local Galaxy. After the download is completed, all our tools will be available on your local Galaxy. If the user wants to use the workflows that we suggested (i.e. STR profiling from short read sequencing data, microsatellite detection of the reference genome, and estimating minimum informative read depth), please proceed to step 3.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
89
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
90 2.7 Refresh your browser
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
91
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
92
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
93 3 Install workflows
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
94
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
95 3.1 Click on the ‘Admin’ tab at the top again.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
96
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
97 3.2 On the right panel, click ‘Manage installed tool shed repositories’ under ‘Server’. ‘Installed tool shed repositories’ will appear on main panel.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
98
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
99 3.3 Click to open ‘str_fm’ repository.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
100
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
101 3.4 Scroll down to ‘Workflows’ section and select the workflow that you want to install. The SGV graphic of the workflow will appear.
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
102
d5ed5c2e25c3 Uploaded
arkarachai-fungtammasan
parents: 0
diff changeset
103 3.5 Click on the ‘Repository Actions’ on the upper right corner and select ‘Import workflow to Galaxy’. If success, the ‘Workflow <workflow name> imported successfully’ will appear. Once the workflow is imported to your Galaxy, you can view and modify it from ‘Workflow’ tab on the top.