Mercurial > repos > miller-lab > genome_diversity
comparison rank_pathways.xml @ 28:184d14e4270d
Update to Miller Lab devshed revision 4ede22dd5500
author | Richard Burhans <burhans@bx.psu.edu> |
---|---|
date | Wed, 17 Jul 2013 12:46:46 -0400 |
parents | 8997f2ca8c7a |
children | a631c2f6d913 |
comparison
equal
deleted
inserted
replaced
27:8997f2ca8c7a | 28:184d14e4270d |
---|---|
60 | 60 |
61 <help> | 61 <help> |
62 | 62 |
63 **Dataset formats** | 63 **Dataset formats** |
64 | 64 |
65 All of the input and output datasets are in tabular_ format. | 65 The query dataset has a column containing ENSEMBL transcript codes for |
66 The input dataset must have columns with KEGG gene ID and pathways. | 66 the gene set of interest, while the background dataset has one column |
67 [Need to update this, since input columns now depend on the "Rank by" choice.] | 67 with ENSEMBL transcript codes and another with GO terms, for some larger |
68 The output datasets are described below. | 68 universe of genes. |
69 (`Dataset missing?`_) | 69 |
70 All of the input and output datasets are in tabular_ format. The input | |
71 dataset (i.e. query) to rank by "percentage of genes affected" has a | |
72 column containing ENSEMBL transcript codes for the gene set of interest, | |
73 while the background dataset has one column with ENSEMBL transcript | |
74 codes and another with KEGG pathways, for some larger universe of genes. | |
75 The input dataset to rank by "change in length and number of paths" | |
76 must have columns with KEGG gene ID and pathways. The output datasets | |
77 are described below. (`Dataset missing?`_) | |
70 | 78 |
71 .. _tabular: ./static/formatHelp.html#tab | 79 .. _tabular: ./static/formatHelp.html#tab |
72 .. _Dataset missing?: ./static/formatHelp.html | 80 .. _Dataset missing?: ./static/formatHelp.html |
73 | 81 |
74 ----- | 82 ----- |
75 | 83 |
76 **What it does** | 84 **What it does** |
77 | 85 |
78 This tool produces a table ranking the pathways based on the percentage | 86 Given a query set of genes from a larger background dataset, this tool |
79 of genes in an input dataset, out of the total in each pathway | 87 evaluates the over- or under-representation of KEGG pathways in the query |
80 [please clarify w.r.t. query and background datasets]. | 88 set, using the specified statistical test. Alternatively, the tool ranks |
81 Alternatively, the tool ranks the pathways based on the change in | 89 the pathways based on the change in length and number of paths connecting |
82 length and number of paths connecting sources and sinks. This change is | 90 sources and sinks. This change is calculated between graphs representing |
83 calculated between graphs representing pathways with and without excluding | 91 pathways with and without excluding the nodes that represent the genes |
84 the nodes that represent the genes in an input list. Sources are all | 92 in an input list. Sources are all the nodes representing the initial |
85 the nodes representing the initial reactants/products in the pathway. | 93 reactants/products in the pathway. Sinks are all the nodes representing |
86 Sinks are all the nodes representing the final reactants/products in | 94 the final reactants/products in the pathway. |
87 the pathway. | |
88 | 95 |
89 If pathways are ranked by percentage of genes affected, the output contains | 96 If pathways are ranked by percentage of genes affected, the output |
90 a row for each KEGG pathway, with the following columns: | 97 contains a row for each KEGG pathway, with the following columns: |
91 | 98 |
92 1. count: the number of genes in the query set that are in this pathway | 99 1. count: the number of genes in the query set that are in this pathway |
93 2. representation: the percentage of this pathway's genes (from the background dataset) that appear in the query set | 100 2. representation: the percentage of this pathway's genes (from the background dataset) that appear in the query set |
94 3. ranking of this pathway, based on its representation ("1" is highest) | 101 3. ranking of this pathway, based on its representation ("1" is highest) |
95 4. probability of depletion of this pathway in the query dataset | 102 4. probability of depletion of this pathway in the query dataset |
96 5. probability of enrichment of this pathway in the query dataset | 103 5. probability of enrichment of this pathway in the query dataset |
97 6. KEGG pathway | 104 6. name of the pathway |
98 | 105 |
99 If pathways are ranked by change in length and number of paths, the | 106 If pathways are ranked by change in length and number of paths, the |
100 output is a tabular dataset with the following columns: | 107 output is a tabular dataset with the following columns: |
101 | 108 |
102 1. change in the mean length of paths between sources and sinks | 109 1. change in the mean length of paths between sources and sinks |
103 2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) | 110 2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) |
104 3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) | 111 3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) |
105 4. rank of the change in the mean length of paths between sources and sinks (from high change to low change) | 112 4. rank of the change in the mean length of paths between sources and sinks (from high change to low change) |
106 5. change in the number of paths between sources and sinks | 113 5. change in the number of paths between sources and sinks |
107 6. number of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) | 114 6. number of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) |
108 7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) | 115 7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) |
109 8. rank of the change in the number of paths between sources and sinks (from high change to low change) | 116 8. rank of the change in the number of paths between sources and sinks (from high change to low change) |
110 9. name of the pathway | 117 9. name of the pathway |
111 | 118 |
112 ----- | 119 ----- |
113 | 120 |
114 **Examples** | 121 **Examples** |
115 | 122 |
116 - input (column 10 for KEGG gene ID, column 12 for KEGG pathways):: | 123 Rank by percentage of genes affected: |
117 | 124 |
125 - input background dataset (column 5 for ENSEMBL transcript, column 12 for KEGG pathways, two-tailed Fisher's exact test for statistic):: | |
126 | |
118 Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways | 127 Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways |
119 Contig62_chr1_19011969_19012646 265 chr1 19012240 ENSCAFT00000000144 ENSCAFP00000000125 * 161 R 483960 probably damaging N | 128 Contig62_chr1_19011969_19012646 265 chr1 19012240 ENSCAFT00000000144 ENSCAFP00000000125 * 161 R 483960 probably damaging N |
120 etc. | 129 etc. |
121 | |
122 - output ranked by percentage of genes affected [need new sample output with more columns]:: | |
123 | 130 |
124 3 0.25 1 cfa03450=Non-homologous end-joining | 131 - input query dataset (column 5 for ENSEMBL transcript):: |
125 1 0.25 1 cfa00750=Vitamin B6 metabolism | 132 |
126 2 0.2 3 cfa00290=Valine, leucine and isoleucine biosynthesis | 133 Contig12_chr20_101969_112646 265 chr20 9822141 ENSCAFT00000001234 ENSCAFP00000021123 T 101 R 476153 probably damaging |
127 3 0.18 4 cfa00770=Pantothenate and CoA biosynthesis | 134 Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging |
128 etc. | 135 etc. |
129 | 136 |
130 - output ranked by change in length and number of paths:: | 137 - output:: |
131 | 138 |
132 3.64 8.44 4.8 2 4 9 5 1 cfa00260=Glycine, serine and threonine metabolism | 139 3 0.20 1 1.0 0.0065 cfa03450=Non-homologous end-joining |
133 7.6 9.6 2 1 3 5 2 2 cfa00240=Pyrimidine metabolism | 140 1 0.067 2 1.0 0.019 cfa00750=Vitamin B6 metabolism |
134 0.05 2.67 2.62 6 1 30 29 3 cfa00982=Drug metabolism - cytochrome P450 | 141 2 0.062 3 1.0 0.021 cfa00290=Valine, leucine and isoleucine biosynthesis |
135 -0.08 8.33 8.41 84 1 30 29 3 cfa00564=Glycerophospholipid metabolism | 142 1 0.037 4 1.0 0.035 cfa00770=Pantothenate and CoA biosynthesis |
136 etc. | 143 etc. |
137 | 144 |
145 Rank by change in length and number of paths: | |
146 | |
147 - input (column 10 for KEGG gene ID, column 12 for KEGG pathways):: | |
148 | |
149 Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways | |
150 Contig62_chr1_19011969_19012646 265 chr1 19012240 ENSCAFT00000000144 ENSCAFP00000000125 * 161 R 483960 probably damaging N | |
151 etc. | |
152 | |
153 - output:: | |
154 | |
155 3.64 8.44 4.8 2 4 9 5 1 cfa00260=Glycine, serine and threonine metabolism | |
156 7.6 9.6 2 1 3 5 2 2 cfa00240=Pyrimidine metabolism | |
157 0.05 2.67 2.62 6 1 30 29 3 cfa00982=Drug metabolism - cytochrome P450 | |
158 -0.08 8.33 8.41 84 1 30 29 3 cfa00564=Glycerophospholipid metabolism | |
159 etc. | |
138 </help> | 160 </help> |
139 </tool> | 161 </tool> |