Mercurial > repos > miller-lab > genome_diversity
changeset 21:d6b961721037
Miller Lab Devshed version 4c04e35b18f6
author | Richard Burhans <burhans@bx.psu.edu> |
---|---|
date | Mon, 05 Nov 2012 12:44:17 -0500 |
parents | 8a4b8efbc82c |
children | 95a05c1ef5d5 |
files | add_fst_column.xml average_fst.xml calctfreq.py commits.log dpmix.xml extract_flanking_dna.xml extract_primers.xml find_intervals.xml map_ensembl_transcripts.xml pathway_image.xml rank_pathways.xml select_snps.xml specify_restriction_enzymes.xml |
diffstat | 13 files changed, 358 insertions(+), 117 deletions(-) [+] |
line wrap: on
line diff
--- a/add_fst_column.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/add_fst_column.xml Mon Nov 05 12:44:17 2012 -0500 @@ -10,11 +10,11 @@ </command> <inputs> - <param name="input" type="data" format="gd_snp" label="SNP table" /> + <param name="input" type="data" format="gd_snp" label="SNP dataset" /> <param name="p1_input" type="data" format="gd_indivs" label="Population 1 individuals" /> <param name="p2_input" type="data" format="gd_indivs" label="Population 2 individuals" /> - <param name="data_source" type="select" format="integer" label="Data source"> + <param name="data_source" type="select" format="integer" label="Frequency metric"> <option value="0" selected="true">sequence coverage</option> <option value="1">estimated genotype</option> </param> @@ -22,20 +22,20 @@ <param name="min_reads" type="integer" min="0" value="0" label="Minimum total read count for a population" /> <param name="min_qual" type="integer" min="0" value="0" label="Minimum individual genotype quality" /> - <param name="retain" type="select" label="Special treatment"> - <option value="0" selected="true">Skip row</option> - <option value="1">Set FST = -1</option> + <param name="retain" type="select" label="If a SNP is below minimum"> + <option value="0" selected="true">skip SNP</option> + <option value="1">set FST = -1</option> </param> - <param name="discard_fixed" type="select" label="Apparently fixed SNPs"> - <option value="0">Retain SNPs that appear fixed in the two populations</option> - <option value="1" selected="true">Delete SNPs that appear fixed in the two populations</option> + <param name="discard_fixed" type="select" label="For SNPs that appear to be fixed across both populations"> + <option value="0">retain</option> + <option value="1" selected="true">delete</option> </param> <param name="biased" type="select" label="FST estimator"> <option value="0" selected="true">Wright's original definition</option> - <option value="1">The Weir-Cockerham estimator</option> - <option value="2">The Reich-Patterson estimator</option> + <option value="1">the Weir-Cockerham estimator</option> + <option value="2">the Reich-Patterson estimator</option> </param> </inputs> @@ -61,13 +61,24 @@ <help> +**Dataset formats** + +The input datasets are in gd_snp_ and gd_indivs_ formats. +The output dataset is in gd_snp_ format. (`Dataset missing?`_) + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** The user specifies a SNP table and two "populations" of individuals, both previously defined using the Galaxy tool to specify individuals from a SNP table. No individual can be in both populations. Other choices are as follows. -Data source. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. +Frequency metric. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. -After specifying the data source, the user sets lower bounds on amount of data required at a SNP. For estimating the Fst using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. +After specifying the frequency metric, the user sets lower bounds on amount of data required at a SNP. For estimating the Fst using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. The user specifies whether the SNPs that violate the lower bound should be ignored or the Fst set to -1. @@ -81,15 +92,46 @@ Sewall Wright (1951) The genetical structure of populations. Ann Eugen 15:323-354. -B. S. Weir and C. Clark Cockerham (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. +Weir, B.S. and Cockerham, C. Clark (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. Weir, B.S. 1996. Population substructure. Genetic data analysis II, pp. 161-173. Sinauer Associates, Sundand, MA. David Reich, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh (2009) Reconstructing Indian population history. Nature 461:489-494, especially Supplement 2. -Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the followoing paper. +Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the following paper. Eva-Maria Willing, Christine Dreyer, Cock van Oosterhout (2012) Estimates of genetic differentiation measured by FST do not necessarily require large sample sizes when using many SNP markers. PLoS One 7:e42649. +----- + +**Example** + +- input, SNP table:: + + #{"column_names":["scaf","pos","A","B","qual","ref","rpos","rnuc","1A","1B","1G","1Q","2A","2B","2G","2Q","3A","3B","3G","3Q","4A","4B","4G","4Q", + #"5A","5B","5G","5Q","6A","6B","6G","6Q","pair","dist","prim","rflp"],"dbkey":"canFam2", + #"individuals":[["PB1",9],["PB2",13],["PB3",17],["PB4",21],["PB6",25],["PB8",29]], + #"pos":2,"rPos":7,"ref":6,"scaffold":1,"species":"bear"} + Contig161_chr1_4641264_4641879 115 C T 73.5 chr1 4641382 C 6 0 2 45 8 0 2 51 15 0 2 72 5 0 2 42 6 0 2 45 10 0 2 57 Y 54 0.323 0 + Contig113_chr5_11052263_11052603 28 C T 38.2 chr5 11052280 C 1 2 1 12 3 2 1 10 5 0 2 42 2 1 2 13 3 0 2 36 8 0 2 51 Y 161 +99. 0 + Contig215_chr5_70946445_70947428 363 T G 28.2 chr5 70946809 C 4 0 2 39 0 5 0 12 9 0 2 54 6 0 2 45 3 3 2 1 9 0 2 54 N 43 0.153 0 + etc. + +- input, Population 1 individuals:: + + 9 PB1 + 13 PB2 + +- input, Population 2 individuals:: + + 17 PB3 + 21 PB4 + +- output (minimum read count of 3, discard fixed):: + + Contig113_chr5_11052263_11052603 28 C T 38.2 chr5 11052280 C 1 2 1 12 3 2 1 10 5 0 2 42 2 1 2 13 3 0 2 36 8 0 2 51 Y 161 +99. 0 0.1636 + Contig215_chr5_70946445_70947428 363 T G 28.2 chr5 70946809 C 4 0 2 39 0 5 0 12 9 0 2 54 6 0 2 45 3 3 2 1 9 0 2 54 N 43 0.153 0 0.3846 + etc. + </help> </tool>
--- a/average_fst.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/average_fst.xml Mon Nov 05 12:44:17 2012 -0500 @@ -15,14 +15,14 @@ </command> <inputs> - <param name="input" type="data" format="gd_snp" label="SNP table" /> + <param name="input" type="data" format="gd_snp" label="SNP dataset" /> <param name="p1_input" type="data" format="gd_indivs" label="Population 1 individuals" /> <param name="p2_input" type="data" format="gd_indivs" label="Population 2 individuals" /> <conditional name="data_source"> - <param name="ds_choice" type="select" format="integer" label="Data source"> - <option value="0" selected="true">sequence coverage and ..</option> - <option value="1">estimated genotype and ..</option> + <param name="ds_choice" type="select" format="integer" label="Frequency metric"> + <option value="0" selected="true">sequence coverage</option> + <option value="1">estimated genotype</option> </param> <when value="0"> <param name="min_value" type="integer" min="1" value="1" label="Minimum total read count for a population" /> @@ -32,15 +32,15 @@ </when> </conditional> - <param name="discard_fixed" type="select" label="Apparently fixed SNPs"> - <option value="0">Retain SNPs that appear fixed in the two populations</option> - <option value="1" selected="true">Delete SNPs that appear fixed in the two populations</option> + <param name="discard_fixed" type="select" label="For SNPs that appear to be fixed across both populations"> + <option value="0">retain</option> + <option value="1" selected="true">delete</option> </param> <conditional name="use_randomization"> <param name="ur_choice" type="select" format="integer" label="Use randomization"> - <option value="0" selected="true">No</option> - <option value="1">Yes</option> + <option value="0" selected="true">no</option> + <option value="1">yes</option> </param> <when value="0" /> <when value="1"> @@ -69,19 +69,32 @@ <help> +**Dataset formats** + +The input datasets are in gd_snp_ and gd_indivs_ formats. +The output dataset is in text_ format. (`Dataset missing?`_) + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _text: ./static/formatHelp.html#text +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** The user specifies a SNP table and two "populations" of individuals, both previously defined using the Galaxy tool to specify individuals from a SNP table. No individual can be in both populations. Other choices are as follows. -Data source. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. +Frequency metric. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. -After specifying the data source, the user sets lower bounds on amount of data required at a SNP. For estimating the FST using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. SMPs not meeting these lower bounds are ignored. +After specifying the frequency metric, the user sets lower bounds on amount of data required at a SNP. For estimating the FST using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. SNPs not meeting these lower bounds are ignored. The user specifies whether SNPs where both populations appear to be fixed for the same allele should be retained or discarded. Finally, the user decides whether to use randomizations. If so, then the user specifies how many randomly generated population pairs (retaining the numbers of individuals of the originals) to generate, as well as the "population" of additional individuals (not in the first two populations) that can be used in the randomization process. The program prints the following measures of FST for the two populations. + 1. The formulation by Sewall Wright (average over FSTs for all SNPs). 2. The Weir-Cockerham estimator (average over FSTs for all SNPs). 3. The Reich-Patterson estimator (average over FSTs for all SNPs). @@ -93,14 +106,27 @@ Sewall Wright (1951) The genetical structure of populations. Ann Eugen 15:323-354. -B. S. Weir and C. Clark Cockerham (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. +Weir, B.S. and Cockerham, C. Clark (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. Weir, B.S. 1996. Population substructure. Genetic data analysis II, pp. 161-173. Sinauer Associates, Sundand, MA. David Reich, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh (2009) Reconstructing Indian population history. Nature 461:489-494, especially Supplement 2. -Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the followoing paper. +Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the following paper. Eva-Maria Willing, Christine Dreyer, Cock van Oosterhout (2012) Estimates of genetic differentiation measured by FST do not necessarily require large sample sizes when using many SNP markers. PLoS One 7:e42649. + +----- + +**Example** + +- output:: + + Using 37847 SNPs, we compute: + Average Wright FST is 0.22810. + Average Weir-Cockerham FST is 0.30813. + Average Reich-Patterson FST is 0.31012. + The population-based Reich-Patterson Fst is 0.33625. + </help> </tool>
--- a/calctfreq.py Tue Oct 23 14:38:04 2012 -0400 +++ b/calctfreq.py Mon Nov 05 12:44:17 2012 -0500 @@ -99,7 +99,11 @@ sKEGGcPthws=dKEGGcPthws.pop(cGen) for eachP in sKEGGcPthws: if eachP!='N': - dPthContsTmp[eachP]+=1 + if eachP in dPthContsTmp: + dPthContsTmp[eachP]+=1 + else: + print >> sys.stderr, "Error: pathway not found in database: '{0}'".format(eachP) + sys.exit(1) cntGens+=1 #~ Calculate Freqs. ltfreqs=[((Decimal(dPthContsTmp[x])/Decimal(dPthContsTotls[x])),Decimal(dPthContsTmp[x]),x) for x in dPthContsTotls]
--- a/commits.log Tue Oct 23 14:38:04 2012 -0400 +++ b/commits.log Mon Nov 05 12:44:17 2012 -0500 @@ -1,3 +1,7 @@ + +:f556345a4185 +cathy 2012-11-02 17:45 +Tweaked parameter labels. :8703e16fca01 cathy 2012-10-04 11:42
--- a/dpmix.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/dpmix.xml Mon Nov 05 12:44:17 2012 -0500 @@ -10,19 +10,19 @@ </command> <inputs> - <param name="input" type="data" format="gd_snp" label="Dataset"> + <param name="input" type="data" format="gd_snp" label="SNP dataset"> <validator type="unspecified_build" message="This dataset does not have a reference species and cannot be used with this tool" /> </param> <param name="ap1_input" type="data" format="gd_indivs" label="Ancestral population 1 individuals" /> <param name="ap2_input" type="data" format="gd_indivs" label="Ancestral population 2 individuals" /> <param name="p_input" type="data" format="gd_indivs" label="Potentially admixed individuals" /> - <param name="data_source" type="select" format="integer" label="Data source"> + <param name="data_source" type="select" format="integer" label="Similarity metric"> <option value="0" selected="true">sequence coverage</option> <option value="1">estimated genotype</option> </param> - <param name="switch_penalty" type="integer" min="0" value="10" label="Switch penalty" /> + <param name="switch_penalty" type="integer" min="0" value="10" label="Genotype switch penalty" help="Note: typically between 10 and 100."/> </inputs> <outputs> @@ -71,13 +71,13 @@ chromosomes) and a set of potentially admixed individuals, and chooses between the sequence coverage or the estimated genotypes to measure the similarity of genomic intervals in admixed individuals to the two -classes of ancestral chromosomes. The user also picks a "switch penalty", +classes of ancestral chromosomes. The user also picks a "genotype switch penalty", typically between 10 and 100. For each potentially admixed individual, the program divides the genome into three "genotypes": (0) homozygous for the first ancestral population (i.e., both chromosomes from that population), (1) heterozygous, or (2) homozygous for the second ancestral population. Parts of a chromosome that are labeled as "heterochromatic" -are given the non-genotype, 3. Smaller values of the switch penalty +are given the non-genotype "3". Smaller values of the switch penalty (corresponding to more ancient admixture events) generally lead to the reconstruction of more frequent changes between genotypes.
--- a/extract_flanking_dna.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/extract_flanking_dna.xml Mon Nov 05 12:44:17 2012 -0500 @@ -12,13 +12,13 @@ </command> <inputs> - <param format="tabular" name="input" type="data" label="Selected SNPS dataset"/> - <param name="output_format" type="select" format="integer" label="output format"> + <param format="tabular" name="input" type="data" label="SNP dataset"/> + <param name="output_format" type="select" format="integer" label="Output format"> <option value="fasta" selected="true">FastA format</option> - <option value="primer3">Primer3 input</option> + <option value="primer3">Boulder-IO (for Primer3)</option> </param> <conditional name="override_metadata"> - <param name="choice" type="select" format="integer" label="choose columns"> + <param name="choice" type="select" format="integer" label="Choose columns" help="Datasets in gd_snp format have the columns in the metadata, all others need the columns chosen." > <option value="0" selected="true">No, get columns from metadata</option> <option value="1" >Yes, choose columns</option> </param> @@ -53,17 +53,31 @@ <help> +**Dataset formats** + +The input dataset is in tabular_ format and must contain a scaffold or +chromosome column and a position column. The output is in fasta_ format or +Boulder-IO_ format used by Primer3. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _fasta: ./static/formatHelp.html#fasta +.. _Boulder-IO: ./static/formatHelp.html#boulder +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** - This tool reports a DNA segment containing each SNP, with up to 200 nucleotides on - either side of the SNP position, which is indicated by "n". Fewer nucleotides - are reported if the SNP is near an end of the assembled genome fragment. +This tool reports a DNA segment containing each SNP, with up to 200 nucleotides +on either side of the SNP position, which is indicated by "n". Fewer nucleotides +are reported if the SNP is near an end of the assembled genome fragment. ----- **Example** -- input file:: +- input (gd_snp format):: chr2_75111355_75112576 314 A C L F chr2 75111676 C F 15 4 53 2 9 48 Y 96 0.369 0.355 0.396 0 chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2 @@ -77,7 +91,7 @@ chr19_39866997_39874915 3117 C T P P chr19 39870110 C P 3 7 65 14 2 32 Y 6 0.321 0.911 0.462 4 etc. -- output file:: +- output (FastA format):: > chr2_75111355_75112576 314 A C TATCTTCATTTTTATTATAGACTCTCTGAACCAATTTGCCCTGAGGCAGACTTTTTAAAGTACTGTGTAATGTATGAAGTCCTTCTGCTCAAGCAAATCATTGGCATGAAAACAGTTGCAAACTTATTGTGAGAGAAGAGTCCAAGAGTTTTAACAGTCTGTAAGTATATAGCCTGTGAGTTTGATTTCCTTCTTGTTTTTnTTCCAGAAACATGATCAGGGGCAAGTTCTATTGGATATAGTCTTCAAGCATCTTGATTTGACTGAGCGTGACTATTTTGGTTTGCAGTTGACTGACGATTCCACTGATAACCCAGTAAGTTTAAGCTGTTGTCTTTCATTGTCATTGCAATTTTTCTGTCTTTATACTAGGTCCTTTCTGATTTACATTGTTCACTGATT
--- a/extract_primers.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/extract_primers.xml Mon Nov 05 12:44:17 2012 -0500 @@ -11,9 +11,9 @@ </command> <inputs> - <param format="tabular" name="input" type="data" label="Selected SNPS dataset"/> + <param format="tabular" name="input" type="data" label="SNP dataset"/> <conditional name="override_metadata"> - <param name="choice" type="select" format="integer" label="choose columns"> + <param name="choice" type="select" format="integer" label="Choose columns" help="Datasets in gd_snp format have the columns in the metadata, all others need the columns chosen." > <option value="0" selected="true">No, get columns from metadata</option> <option value="1" >Yes, choose columns</option> </param> @@ -46,30 +46,46 @@ <help> +**Dataset formats** + +The input dataset is in tabular_ format and must contain a scaffold or +chromosome column and a position column. The output dataset is in text_ +format as described below. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _text: ./static/formatHelp.html#text +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** - This tool extracts primers for SNPs in the dataset using the Primer3 program. - The first line of output for a given SNP reports the name of the assembled - contig, the SNP's position in the contig, the two variant nucleotides, and - Primer3's "pair penalty". The next line, if not blank, names restriction - enzymes (from the user-adjustable list) that differentially cut at that - site, but do not cut at any other position between and including the - primer positions. The next lines show the SNP's flanking regions, with - the SNP position indicated by "n", including the primer positions and an - additional 3 nucleotides. +This tool extracts primers for SNPs in the dataset using the Primer3 program +(Steve Rozen and Helen J. Skaletsky, 2000). +The first line of output for a given SNP reports the name of the assembled +contig, the SNP's position in the contig, the two variant nucleotides, and +Primer3's "pair penalty". The next line, if not blank, names restriction +enzymes (from the user-adjustable list) that differentially cut at that +site, but do not cut at any other position between and including the +primer positions. The next lines show the SNP's flanking regions, with +the SNP position indicated by "n", including the primer positions and an +additional 3 nucleotides. +<!-- is this precomputed?? how, where is the user-adjustable list? --> ----- **Example** -- input file:: +- input (gd_snp format):: chr5_30800874_30802049 734 G A chr5 30801606 A 24 0 99 4 11 97 Y 496 0.502 0.033 0.215 6 chr8_55117827_55119487 994 A G chr8 55118815 G 25 0 102 4 11 96 Y 22 0.502 0.025 2.365 1 chr9_100484836_100485311 355 C T chr9 100485200 T 27 0 108 6 17 100 Y 190 0.512 0.880 2.733 4 chr12_3635530_3637738 2101 T C chr12 3637630 T 25 0 102 4 13 93 Y 169 0.554 0.024 0.366 4 + etc. -- output file:: +- output:: chr5_30800874_30802049 734 G A 0.352964 BglII,MboI,Sau3AI,Tru9I,XhoII
--- a/find_intervals.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/find_intervals.xml Mon Nov 05 12:44:17 2012 -0500 @@ -22,41 +22,41 @@ </command> <inputs> - <param name="input" type="data" format="tabular" label="Input"> + <param name="input" type="data" format="tabular" label="Dataset"> <validator type="unspecified_build" message="This dataset does not have a reference species and cannot be used with this tool" /> </param> <param name="score_col" type="data_column" data_ref="input" numerical="true" label="Column with score"/> <conditional name="cutoff"> - <param name="type" type="select" label="Cutoff type"> + <param name="type" type="select" label="Score-shift type"> <option value="percentage">percentage</option> <option value="value">value</option> </param> <when value="percentage"> - <param name="cutoff_pct" type="float" value="95" min="0" max="100" label="Percentage cutoff"/> + <param name="cutoff_pct" type="float" value="95" min="0" max="100" label="Percentage score-shift"/> </when> <when value="value"> - <param name="cutoff_val" type="float" value="0.0" label="Value cutoff"/> + <param name="cutoff_val" type="float" value="0.0" label="Value score-shift"/> </when> </conditional> <param name="shuffles" type="integer" min="0" value="0" label="Number of randomizations"/> <param name="out_format" type="select" format="integer" label="Report individual positions"> - <option value="0" selected="true">No</option> - <option value="1">Yes</option> + <option value="0" selected="true">no</option> + <option value="1">yes</option> </param> <conditional name="override_metadata"> - <param name="choice" type="select" format="integer" label="Choose columns" help="Note: you need to choose the columns if the input dataset is not gd_snp"> - <option value="0" selected="true">No, get columns from metadata</option> - <option value="1" >Yes, choose columns</option> + <param name="choice" type="select" format="integer" label="Choose columns" help="Note: you must choose the columns if the input dataset is not gd_snp."> + <option value="0" selected="true">no, get columns from metadata</option> + <option value="1" >yes, choose columns here</option> </param> <when value="0" /> <when value="1"> - <param name="ref_col" type="data_column" data_ref="input" numerical="false" label="Column with reference chromosome" help="Note: be sure the build in the metadata is the same as using here."/> - <param name="rpos_col" type="data_column" data_ref="input" numerical="true" label="Column with reference position" help="Note: either zero or one based positions will work"/> + <param name="ref_col" type="data_column" data_ref="input" numerical="false" label="Column with reference chromosome" help="Note: be sure this corresponds to the build recorded in the metadata."/> + <param name="rpos_col" type="data_column" data_ref="input" numerical="true" label="Column with reference position" help="Note: either zero-based or one-based positions will work."/> </when> </conditional> </inputs> @@ -105,14 +105,14 @@ For gd_snp format the metadata can be used to specify the chromosome and position. Other inputs include -a percentage or raw score for the "cutoff" which should be greater than the +a percentage or raw score for the "score-shift" which should be greater than the average value for the scores column. A higher value will give smaller intervals in the output. If a percentage (e.g. 95%) is specified -then that percentile of the scores is used as the cutoff; +then that percentile of the scores is used as the shift; percentile may not work well if many rows or SNPs have the same score (in that case use a raw score). The program subtracts the -cutoff from every score, then finds genomic intervals (i.e., consecutive runs +shift from every score, then finds genomic intervals (i.e., consecutive runs of SNPs) whose total score cannot be increased by adding or subtracting one or more adjusted scores at the ends of the interval. Another input is the number of times the
--- a/map_ensembl_transcripts.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/map_ensembl_transcripts.xml Mon Nov 05 12:44:17 2012 -0500 @@ -11,8 +11,10 @@ </command> <inputs> - <param name="input" type="data" format="tabular" label="Table" /> - <param name="ensembl_col" type="data_column" data_ref="input" label="Column with ENSEMBL transcript code" /> + <param name="input" type="data" format="tabular" label="Dataset" > + <validator type="unspecified_build" message="This dataset does not have a database/build and cannot be used with this tool" /> + </param> + <param name="ensembl_col" type="data_column" data_ref="input" label="Column with ENSEMBL transcript ID" /> </inputs> <outputs> @@ -34,9 +36,46 @@ <help> +**Dataset formats** + +The input and output datasets are in tabular_ format. +The input dataset must have a column with an ENSEMBL transcript ID and have +the database/build set. Even though positions are not needed the correct +database/build must be given to look up the pathways. +The output dataset will have added columns for the pathway. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** -Adds the fields KEGG gene codes and KEGG pathways to an input table of ENSEMBL transcript codes. +Adds the fields "KEGG gene ID" and "KEGG pathways" to an input table of ENSEMBL +transcript IDs. A "U" in the KEGG gene ID field indicates that the +tool cannot link the ENSEMBL transcript ID to a KEGG gene ID. +An "N" in the pathway field means the KEGG pathway is unknown. + +----- + +**Example** + +- input:: + ENSCAFT00000000001 + ENSCAFT00000000144 + ENSCAFT00000000160 + ENSCAFT00000000215 + etc. + +- output:: + + ENSCAFT00000000001 476153 cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways + ENSCAFT00000000144 483960 N + ENSCAFT00000000160 610160 N + ENSCAFT00000000215 U N + etc. + </help> </tool>
--- a/pathway_image.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/pathway_image.xml Mon Nov 05 12:44:17 2012 -0500 @@ -6,15 +6,15 @@ "--input=${input}" "--output=${output}" "--KEGGpath=${pathway}" - "--posKEGGclmn=${input.metadata.kegg_path}" - "--KEGGgeneposcolmn=${input.metadata.kegg_gene}" + "--posKEGGclmn=${kpath}" + "--KEGGgeneposcolmn=${kgene}" </command> <inputs> - <param name="input" type="data" format="gd_sap" label="Table"> - <validator type="metadata" check="kegg_gene,kegg_path" message="Missing KEGG gene code column and/or KEGG pathway code/name column metadata. Click the pencil icon in the history item to edit/save the metadata attributes" /> - </param> - <param name="pathway" type="select"> + <param name="input" type="data" format="tabular" label="Dataset" /> + <param name="kgene" type="data_column" data_ref="input" label="Column with KEGG gene ID" /> + <param name="kpath" type="data_column" data_ref="input" numerical="false" label="Column with KEGG pathways" /> + <param name="pathway" label="Pathway" type="select"> <options from_file="gd.pathways.txt"> <column name="value" index="1"/> <column name="name" index="2"/> @@ -30,6 +30,8 @@ <tests> <test> <param name="input" value="test_in/sample.gd_sap" ftype="gd_sap" /> + <param name="kpath" value="10" /> + <param name="kgene" value="12" /> <param name="pathway" value="cfa05214" /> <output name="output" file="test_out/pathway_image/pathway_image.png" compare="sim_size" delta = "10000" /> </test> @@ -37,12 +39,45 @@ <help> +**Dataset formats** + +The input and output datasets are in tabular_ format. +The input dataset must have columns with KEGG gene ID and pathways. +The output dataset is described below. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** -This tool produces an image of an input KEGG pathway, highlighting the -modules representing genes in an input list. NOTE: a given gene can +This tool produces an image of a KEGG pathway, highlighting (in red) the +modules representing genes in the input dataset. Click here_ for help +with reading the pathway map. + +NOTE: a given gene can be assigned to multiple modules, and different genes can be assigned to the same module. +.. _here: http://www.genome.jp/kegg/document/help_pathway.html + +----- + +**Example** + +- input:: + + 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways + 483960 probably damaging N + 610160 possibly damaging N + 403657 benign cfa04010=MAPK signaling pathway.cfa04012=ErbB signaling pathway.cfa04060=Cytokine-cytokine receptor interaction.cfa04144=Endocytosis.cfa04510=Focal adhesion.cfa04540=Gap junction.cfa04810=Regulation of actin cytoskeleton.cfa05160=Hepatitis C.cfa05200=Pathways in cancer.cfa05212=Pancreatic cancer.cfa05213=Endometrial cancer.cfa05214=Glioma.cfa05215=Prostate cancer.cfa05218=Melanoma.cfa05219=Bladder cancer.cfa05223=Non-small cell lung cancer + etc. + +output showing pathway cfa05214: + +.. image:: ${static_path}/images/gd_pathway_image.png + </help> </tool>
--- a/rank_pathways.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/rank_pathways.xml Mon Nov 05 12:44:17 2012 -0500 @@ -7,19 +7,19 @@ #else if str($output_format) == 'b' calclenchange.py #end if - "--loc_file=${GALAXY_DATA_INDEX_DIR}/gd.rank.loc" - "--species=${input.metadata.dbkey}" - "--input=${input}" - "--output=${output}" - "--posKEGGclmn=${input.metadata.kegg_path}" - "--KEGGgeneposcolmn=${input.metadata.kegg_gene}" + "--loc_file=${GALAXY_DATA_INDEX_DIR}/gd.rank.loc" + "--species=${input.metadata.dbkey}" + "--input=${input}" + "--output=${output}" + "--posKEGGclmn=${kpath}" + "--KEGGgeneposcolmn=${kgene}" </command> <inputs> - <param name="input" type="data" format="gd_sap" label="Table"> - <validator type="metadata" check="kegg_gene,kegg_path" message="Missing KEGG gene code column and/or KEGG pathway code/name column metadata. Click the pencil icon in the history item to edit/save the metadata attributes" /> - </param> - <param name="output_format" type="select" label="Output format"> + <param name="input" type="data" format="tab" label="Dataset" /> + <param name="kgene" type="data_column" data_ref="input" label="Column with KEGG gene ID" /> + <param name="kpath" type="data_column" data_ref="input" numerical="false" label="Column with KEGG pathways" /> + <param name="output_format" type="select" label="Output"> <option value="a" selected="true">ranked by percentage of genes affected</option> <option value="b">ranked by change in length and number of paths</option> </param> @@ -32,6 +32,8 @@ <tests> <test> <param name="input" value="test_in/sample.gd_sap" ftype="gd_sap" /> + <param name="kgene" value="10" /> + <param name="kpath" value="12" /> <param name="output_format" value="a" /> <output name="output" file="test_out/rank_pathways/rank_pathways.tabular" /> </test> @@ -39,6 +41,18 @@ <help> +**Dataset formats** + +The input and output datasets are in tabular_ format. +The input dataset must have columns with KEGG gene ID and pathways. +The output dataset is described below. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** This tool produces a table ranking the pathways based on the percentage @@ -54,23 +68,49 @@ If pathways are ranked by percentage of genes affected, the output is a tabular dataset with the following columns: - 1. number of genes in the pathway present in the input dataset - 2. percentage of the total genes in the pathway included in the input dataset - 3. rank of the frequency (from high freq to low freq) - 4. name of the pathway +1. number of genes in the pathway present in the input dataset +2. percentage of the total genes in the pathway included in the input dataset +3. rank of the frequency (from high freq to low freq) +4. name of the pathway If pathways are ranked by change in length and number of paths, the output is a tabular dataset with the following columns: - 1. change in the mean length of paths between sources and sinks - 2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) - 3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) - 4. rank of the change in the mean length of paths between sources and sinks (from high change to low change) - 5. change in the number of paths between sources and sinks - 6. number of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) - 7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) - 8. rank of the change in the number of paths between sources and sinks (from high change to low change) - 9. name of the pathway +1. change in the mean length of paths between sources and sinks +2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) +3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) +4. rank of the change in the mean length of paths between sources and sinks (from high change to low change) +5. change in the number of paths between sources and sinks +6. number of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) +7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) +8. rank of the change in the number of paths between sources and sinks (from high change to low change) +9. name of the pathway + +----- + +**Examples** + +- input (column 10 for KEGG gene ID, column 12 for KEGG pathways):: + + Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways + Contig62_chr1_19011969_19012646 265 chr1 19012240 ENSCAFT00000000144 ENSCAFP00000000125 * 161 R 483960 probably damaging N + etc. + +- output ranked by percentage of genes affected:: + + 3 0.25 1 cfa03450=Non-homologous end-joining + 1 0.25 1 cfa00750=Vitamin B6 metabolism + 2 0.2 3 cfa00290=Valine, leucine and isoleucine biosynthesis + 3 0.18 4 cfa00770=Pantothenate and CoA biosynthesis + etc. + +- output ranked by change in length and number of paths:: + + 3.64 8.44 4.8 2 4 9 5 1 cfa00260=Glycine, serine and threonine metabolism + 7.6 9.6 2 1 3 5 2 2 cfa00240=Pyrimidine metabolism + 0.05 2.67 2.62 6 1 30 29 3 cfa00982=Drug metabolism - cytochrome P450 + -0.08 8.33 8.41 84 1 30 29 3 cfa00564=Glycerophospholipid metabolism + etc. </help> </tool>
--- a/select_snps.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/select_snps.xml Mon Nov 05 12:44:17 2012 -0500 @@ -11,12 +11,12 @@ </command> <inputs> - <param format="tabular" name="input" type="data" label="Selected SNPS dataset"> + <param format="tabular" name="input" type="data" label="SNP dataset"> <validator type="unspecified_build" message="This dataset does not have a reference species and cannot be used with this tool" /> </param> <param name="num_snps" type="integer" value="10" optional="false" min="1" label="Number of SNPs"/> <conditional name="override_metadata"> - <param name="choice" type="select" format="integer" label="choose columns"> + <param name="choice" type="select" format="integer" label="Choose columns" help="Datasets in gd_snp format have the column information in the metadata, all others must be chosen." > <option value="0" selected="true">No, get columns from metadata</option> <option value="1" >Yes, choose columns</option> </param> @@ -50,17 +50,27 @@ <help> +**Dataset formats** + +The input and output datasets are in tabular_ format. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** - This tool attempts to select a specified number of SNPs from the dataset, making them - approximately uniformly spaced relative to the reference genome. The number - actually selected may be slightly more than the specified number. +This tool attempts to select a specified number of SNPs from the dataset, making +them approximately uniformly spaced relative to the reference genome. The number +actually selected may be slightly more than the specified number. ----- **Example** -- input file:: +- input (gd_snp format):: chr2_75111355_75112576 314 A C L F chr2 75111676 C F 15 4 53 2 9 48 Y 96 0.369 0.355 0.396 0 chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2 @@ -74,7 +84,7 @@ chr19_39866997_39874915 3117 C T P P chr19 39870110 C P 3 7 65 14 2 32 Y 6 0.321 0.911 0.462 4 etc. -- output file:: +- output:: chr2_75111355_75112576 314 A C L F chr2 75111676 C F 15 4 53 2 9 48 Y 96 0.369 0.355 0.396 0 chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2
--- a/specify_restriction_enzymes.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/specify_restriction_enzymes.xml Mon Nov 05 12:44:17 2012 -0500 @@ -12,9 +12,9 @@ </command> <inputs> - <param format="tabular" name="input" type="data" label="Selected SNPS dataset"/> + <param format="tabular" name="input" type="data" label="SNP dataset"/> <conditional name="override_metadata"> - <param name="choice" type="select" format="integer" label="choose columns"> + <param name="choice" type="select" format="integer" label="Choose columns" help="Datasets in gd_snp format have the columns in the metadata, all others need the columns chosen." > <option value="0" selected="true">No, get columns from metadata</option> <option value="1" >Yes, choose columns</option> </param> @@ -54,17 +54,28 @@ <help> +**Dataset formats** + +The input and output datasets are in tabular_ format. +The input dataset must contain columns for scaffold or chromosome and position. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** - It selects the SNPs that are differentially cut by at least one of the - specified restriction enzymes. The enzymes are required to cut the amplified - segment (for the specified PCR primers) only at the SNP. +It selects the SNPs that are differentially cut by at least one of the +specified restriction enzymes. The enzymes are required to cut the amplified +segment (for the specified PCR primers) only at the SNP. ----- **Example** -- input file:: +- input (gd_snp format):: chr2_75111355_75112576 314 A C L F chr2 75111676 C F 15 4 53 2 9 48 Y 96 0.369 0.355 0.396 0 chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2 @@ -78,7 +89,7 @@ chr19_39866997_39874915 3117 C T P P chr19 39870110 C P 3 7 65 14 2 32 Y 6 0.321 0.911 0.462 4 etc. -- output file:: +- output:: chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2 chr14_80021455_80022064 138 G A H H chr14 80021593 G H 14 0 69 9 6 124 Y 377 0.118 0.997 0.195 1