comparison GEMBASSY-1.0.3/doc/text/genret.txt @ 0:8300eb051bea draft

Initial upload
author ktnyt
date Fri, 26 Jun 2015 05:19:29 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:8300eb051bea
1 genret
2 Function
3
4 Retrieves various gene related information from genome flatfile
5
6 Description
7
8 genret reads in one or more genome flatfiles and retrieves various data from
9 the input file. It is a wrapper program to the G-language REST service,
10 where a method is specified by giving a string to the "method" qualifier. By
11 default, genret will parse the input file to retrieve the accession ID
12 (or name) of the genome to query G-language REST service. By setting the
13 "accid" qualifier to false (or 0), genret will instead parse the sequence
14 and features of the genome to create a GenBank formatted flatfile and upload
15 the file to the G-language web server. Using the file uploaded, genret will
16 execute the method provided.
17
18 genret is able to perform a variety of tasks, incluing the retrieval of
19 sequence upstream, downstream, or around the start or stop codon,
20 translated gene sequences search of gene data by keyword, and re-annotation
21 and retrieval of genome flatfiles. The set of genes can be given as flat
22 text, regular expression, or a file containing the list of genes.
23
24 Details on G-language REST service is available from the wiki page
25
26 http://www.g-language.org/wiki/rest
27
28 Documentation on G-language Genome Analysis Environment methods are
29 provided at the Document Center
30
31 http://ws.g-language.org/gdoc/
32
33 Usage
34
35 Here is a sample session with genret
36
37 Retrieving sequences upstream, downstream, or around the start/stop codons.
38 The following example shows the retrieval of sequence around the start
39 codons of all genes.
40
41 Genes to access are specified by regular expression. '*' stands for every
42 gene.
43
44 Available methods are:
45 after_startcodon
46 after_stopcodon
47 around_startcodon
48 around_stopcodon
49 before_startcodon
50 before_stopcodon
51
52 % genret
53 Retrieves various gene related information from genome flatfile
54 Input nucleotide sequence(s): refseqn:NC_000913
55 Gene name(s) to lookup [*]:
56 Feature to access: around_startcodon
57 Full text output file [nc_000913.around_startcodon]:
58
59 Go to the input files for this example
60 Go to the output files for this example
61
62 Example 2
63
64 Using flat text as target genes. The names can be split with with a space,
65 comma, or vertical bar.
66
67 % genret
68 Retrieves various gene related information from genome flatfile
69 Input nucleotide sequence(s): refseqn:NC_000913
70 List of gene name(s) to report [*]: recA,recB
71 Name of gene feature to access: translation
72 Sequence output file [nc_000913.translation.genret]: stdout
73 >recA
74 MAIDENKQKALAAALGQIEKQFGKGSIMRLGEDRSMDVETISTGSLSLDIALGAGGLPMGR
75 IVEIYGPESSGKTTLTLQVIAAAQREGKTCAFIDAEHALDPIYARKLGVDIDNLLCSQPDT
76 GEQALEICDALARSGAVDVIVVDSVAALTPKAEIEGEIGDSHMGLAARMMSQAMRKLAGNL
77 KQSNTLLIFINQIRMKIGVMFGNPETTTGGNALKFYASVRLDIRRIGAVKEGENVVGSETR
78 VKVVKNKIAAPFKQAEFQILYGEGINFYGELVDLGVKEKLIEKAGAWYSYKGEKIGQGKAN
79 ATAWLKDNPETAKEIEKKVRELLLSNPNSTPDFSVDDSEGVAETNEDF
80 >recB
81 MSDVAETLDPLRLPLQGERLIEASAGTGKTFTIAALYLRLLLGLGGSAAFPRPLTVEELLV
82 VTFTEAATAELRGRIRSNIHELRIACLRETTDNPLYERLLEEIDDKAQAAQWLLLAERQMD
83 EAAVFTIHGFCQRMLNLNAFESGMLFEQQLIEDESLLRYQACADFWRRHCYPLPREIAQVV
84 FETWKGPQALLRDINRYLQGEAPVIKAPPPDDETLASRHAQIVARIDTVKQQWRDAVGELD
85 ALIESSGIDRRKFNRSNQAKWIDKISAWAEEETNSYQLPESLEKFSQRFLEDRTKAGGETP
86 RHPLFEAIDQLLAEPLSIRDLVITRALAEIRETVAREKRRRGELGFDDMLSRLDSALRSES
87 GEVLAAAIRTRFPVAMIDEFQDTDPQQYRIFRRIWHHQPETALLLIGDPKQAIYAFRGADI
88 FTYMKARSEVHAHYTLDTNWRSAPGMVNSVNKLFSQTDDAFMFREIPFIPVKSAGKNQALR
89 FVFKGETQPAMKMWLMEGESCGVGDYQSTMAQVCAAQIRDWLQAGQRGEALLMNGDDARPV
90 RASDISVLVRSRQEAAQVRDALTLLEIPSVYLSNRDSVFETLEAQEMLWLLQAVMTPEREN
91 TLRSALATSMMGLNALDIETLNNDEHAWDVVVEEFDGYRQIWRKRGVMPMLRALMSARNIA
92 ENLLATAGGERRLTDILHISELLQEAGTQLESEHALVRWLSQHILEPDSNASSQQMRLESD
93 KHLVQIVTIHKSKGLEYPLVWLPFITNFRVQEQAFYHDRHSFEAVLDLNAAPESVDLAEAE
94 RLAEDLRLLYVALTRSVWHCSLGVAPLVRRRGDKKGDTDVHQSALGRLLQKGEPQDAAGLR
95 TCIEALCDDDIAWQTAQTGDNQPWQVNDVSTAELNAKTLQRLPGDNWRVTSYSGLQQRGHG
96 IAQDLMPRLDVDAAGVASVVEEPTLTPHQFPRGASPGTFLHSLFEDLDFTQPVDPNWVREK
97 LELGGFESQWEPVLTEWITAVLQAPLNETGVSLSQLSARNKQVEMEFYLPISEPLIASQLD
98 TLIRQFDPLSAGCPPLEFMQVRGMLKGFIDLVFRHEGRYYLLDYKSNWLGEDSSAYTQQAM
99 AAAMQAHRYDLQYQLYTLALHRYLRHRIADYDYEHHFGGVIYLFLRGVDKEHPQQGIYTTR
100 PNAGLIALMDEMFAGMTLEEA
101
102 Example 3
103
104 Using a file with a list of gene names.
105 The following example will retrieve the strand direction for each gene
106 listed in the "gene_list.txt" file. String prefixed with an "@" or "list::"
107 will be interpreted as file names.
108
109 % genret
110 Retrieves various gene features from genome flatfile
111 Input nucleotide sequence(s): refseqn:NC_000913
112 List of gene name(s) to report [*]: @gene_list.txt
113 Name of gene feature to access: direction
114 Full text output file [nc_000913.direction]: stdout
115 gene,direction
116 thrA,direct
117 thrB,direct
118 thrC,direct
119
120 Go to the input files for this example
121 Go to the output files for this example
122
123 Example 4
124
125 Retrieving translations of coding sequences.
126 The following example will retrieve the translated protein sequence of
127 the "recA" gene.
128
129 % genret
130 Retrieves various gene related information from genome flatfile
131 Input nucleotide sequence(s): refseqn:NC_000913
132 Gene name(s) to lookup [*]: recA
133 Feature to access: translation
134 Full text output file [nc_000913.translation]: stdout
135 >recA
136 MAIDENKQKALAAALGQIEKQFGKGSIMRLGEDRSMDVETISTGSLSLDIALGAGGLPMGR
137 IVEIYGPESSGKTTLTLQVIAAAQREGKTCAFIDAEHALDPIYARKLGVDIDNLLCSQPDT
138 GEQALEICDALARSGAVDVIVVDSVAALTPKAEIEGEIGDSHMGLAARMMSQAMRKLAGNL
139 KQSNTLLIFINQIRMKIGVMFGNPETTTGGNALKFYASVRLDIRRIGAVKEGENVVGSETR
140 VKVVKNKIAAPFKQAEFQILYGEGINFYGELVDLGVKEKLIEKAGAWYSYKGEKIGQGKAN
141 ATAWLKDNPETAKEIEKKVRELLLSNPNSTPDFSVDDSEGVAETNEDF
142
143 Example 5
144
145 Retrieving feature information of the genes.
146 The following example will retrieve the start positions for each gene.
147 The values for the keys in GenBank format is available for retrieval.
148 (ex. start end direction GO* etc.)
149 Positions will be returned with a 1 start value.
150
151 % genret
152 Retrieves various gene related information from genome flatfile
153 Input nucleotide sequence(s): refseqn:NC_000913
154 Gene name(s) to lookup [*]:
155 Feature to access: start
156 Full text output file [nc_000913.start]:
157
158 Go to the input files for this example
159 Go to the output files for this example
160
161 Example 6
162
163 Passing extra arguments to the methods.
164 The following example shows the retrieval of 30 base pairs around the
165 start codon of the "recA" gene. By default, the "around_startcodon" method
166 returns 200 base pairs around the start codon. Using the "-argument"
167 qualifier allows the user to change this value.
168
169 % genret refseqn:NC_000913 recA around_startcodon -argument 30,30 stdout
170 Retrieves various gene features from genome flatfile
171 >recA
172 ccggtattacccggcatgacaggagtaaaaatggctatcgacgaaaacaaacagaaagcgt
173 tg
174
175 Example 7
176
177 Re-annotating a flatfile.
178 genret supports re-annotation of a genome flatfile via Restauro-G
179 service developed by our team. Using the BLAST Like Alignment Tool,
180 to refer the UniProt KB and annotates information including the description,
181 comments, feature tables, cross references, COG family, position, and Pfam.
182 The original software is available at [http://restauro-g.iab.keio.ac.jp].
183
184
185 % genret refseqn:NC_000913 '*' annotate nc_000913-annotate.gbk
186 Retrieves various gene features from genome flatfile
187
188 Command line arguments
189
190 Standard (Mandatory) qualifiers:
191 [-sequence] seqall Nucleotide sequence(s) filename and optional
192 format, or reference (input USA)
193 [-gene] string [*] Gene name(s) to lookup (Any string)
194 [-access] string Feature to access (Any string)
195 [-outfile] outfile [*.genret] Full text output file
196
197 Additional (Optional) qualifiers: (none)
198 Advanced (Unprompted) qualifiers:
199 -argument string Option to give to method (Any string)
200 -[no]accid boolean [Y] Include to use sequence accession ID as
201 query
202
203 General qualifiers:
204 -help boolean Report command line options and exit. More
205 information on associated and general
206 qualifiers can be found with -help -verbose
207
208 Input file format
209
210 Database definitions for the examples are included in the embossrc_template
211 file of the Keio Bioinformatcs Web Service (KBWS) package.
212
213 Input files for usage example 4
214
215 File: gene_list.txt
216
217 thrA
218 thrB
219 thrC
220
221 Output file format
222
223 Output files for usage example 1
224
225 File: nc_000913.around_startcodon
226
227 >thrL
228 cgtgagtaaattaaaattttattgacttaggtcactaaatactttaaccaatataggcata
229 gcgcacagacagataaaaattacagagtacacaacatccatgaaacgcattagcaccacca
230 ttaccaccaccatcaccattaccacaggtaacggtgcgggctgacgcgtacaggaaacaca
231 gaaaaaagcccgcacctgac
232 >thrA
233 aggtaacggtgcgggctgacgcgtacaggaaacacagaaaaaagcccgcacctgacagtgc
234 gggctttttttttcgaccaaaggtaacgaggtaacaaccatgcgagtgttgaagttcggcg
235 gtacatcagtggcaaatgcagaacgttttctgcgtgttgccgatattctggaaagcaatgc
236 caggcaggggcaggtggcca
237
238 [Part of this file has been deleted for brevity]
239
240 >yjjY
241 tgcatgtttgctacctaaattgccaactaaatcgaaacaggaagtacaaaagtccctgacc
242 tgcctgatgcatgctgcaaattaacatgatcggcgtaacatgactaaagtacgtaattgcg
243 ttcttgatgcactttccatcaacgtcaacaacatcattagcttggtcgtgggtactttccc
244 tcaggacccgacagtgtcaa
245 >yjtD
246 tttttctgcgacttacgttaagaatttgtaaattcgcaccgcgtaataagttgacagtgat
247 cacccggttcgcggttatttgatcaagaagagtggcaatatgcgtataacgattattctgg
248 tcgcacccgccagagcagaaaatattggggcagcggcgcgggcaatgaaaacgatggggtt
249 tagcgatctgcggattgtcg
250
251 Output files for usage example 5
252
253 File: nc_000913.start
254
255 gene,start
256 thrL,190
257 thrA,337
258 thrB,2801
259 thrC,3734
260 yaaX,5234
261 yaaA,5683
262 yaaJ,6529
263 talB,8238
264 mog,9306
265
266 [Part of this file has been deleted for brevity]
267
268 yjjX,4631256
269 ytjC,4631820
270 rob,4632464
271 creA,4633544
272 creB,4634030
273 creC,4634719
274 creD,4636201
275 arcA,4637613
276 yjjY,4638425
277 yjtD,4638965
278
279 Output files for usage example 7
280
281 File: ecoli-annotate.gbk
282
283 LOCUS NC_000913 4639675 bp DNA circular BCT 25-OCT-2010
284 DEFINITION Escherichia coli str. K-12 substr. MG1655 chromosome, complete
285 genome.
286 ACCESSION NC_000913
287 VERSION NC_000913.2 GI:49175990
288 DBLINK Project: 57779
289 KEYWORDS .
290 SOURCE Escherichia coli str. K-12 substr. MG1655
291 ORGANISM Escherichia coli str. K-12 substr. MG1655
292 Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
293
294 [Part of this file has been deleted for brevity]
295
296 CDS 2801..3733
297 /EC_number="2.7.1.39"
298 /codon_start="1"
299 /db_xref="GI:16127997"
300 /db_xref="ASAP:ABE-0000010"
301 /db_xref="UniProtKB/Swiss-Prot:P00547"
302 /db_xref="ECOCYC:EG10999"
303 /db_xref="EcoGene:EG10999"
304 /db_xref="GeneID:947498"
305 /function="enzyme; Amino acid biosynthesis: Threonine"
306 /function="1.5.1.8 metabolism; building block
307 biosynthesis; amino acids; threonine"
308 /function="7.1 location of gene products; cytoplasm"
309 /gene="thrB"
310 /gene_synonym="ECK0003; JW0002"
311 /locus_tag="b0003"
312 /note="GO_component: GO:0005737 - cytoplasm; GO_process:
313 GO:0009088 - threonine biosynthetic process"
314 /product="homoserine kinase"
315 /protein_id="NP_414544.1"
316 /rs_com="FUNCTION: Catalyzes the ATP-dependent
317 phosphorylation of L- homoserine to L-homoserine
318 phosphate (By similarity)."
319 /rs_com="CATALYTIC ACTIVITY: ATP + L-homoserine = ADP +
320 O-phospho-L- homoserine."
321 /rs_com="PATHWAY: Amino-acid biosynthesis; L-threonine
322 biosynthesis; L- threonine from L-aspartate: step 4/5."
323 /rs_com="SUBCELLULAR LOCATION: Cytoplasm (Potential)."
324 /rs_com="SIMILARITY: Belongs to the GHMP kinase family.
325 Homoserine kinase subfamily."
326 /rs_des="RecName: Full=Homoserine kinase; Short=HK;
327 Short=HSK; EC=2.7.1.39;"
328 /rs_protein="Level 1: similar to KHSE_ECODH 1.7e-180"
329 /rs_xr="EMBL; CP000948; ACB01208.1; -; Genomic_DNA."
330 /rs_xr="RefSeq; YP_001728986.1; -."
331 /rs_xr="ProteinModelPortal; B1XBC8; -."
332 /rs_xr="SMR; B1XBC8; 2-308."
333 /rs_xr="EnsemblBacteria; EBESCT00000012034;
334 EBESCP00000011562; EBESCG00000011096."
335 /rs_xr="GeneID; 6058639; -."
336 /rs_xr="GenomeReviews; CP000948_GR; ECDH10B_0003."
337 /rs_xr="KEGG; ecd:ECDH10B_0003; -."
338 /rs_xr="HOGENOM; HBG646290; -."
339 /rs_xr="OMA; GSAHADN; -."
340 /rs_xr="ProtClustDB; PRK01212; -."
341 /rs_xr="BioCyc; ECOL316385:ECDH10B_0003-MONOMER; -."
342 /rs_xr="GO; GO:0005737; C:cytoplasm;
343 IEA:UniProtKB-SubCell."
344 /rs_xr="GO; GO:0005524; F:ATP binding; IEA:UniProtKB-KW."
345 /rs_xr="GO; GO:0004413; F:homoserine kinase activity;
346 IEA:EC."
347 /rs_xr="GO; GO:0009088; P:threonine biosynthetic process;
348 IEA:UniProtKB-KW."
349 /rs_xr="HAMAP; MF_00384; Homoser_kinase; 1; -."
350 /rs_xr="InterPro; IPR006204; GHMP_kinase."
351 /rs_xr="InterPro; IPR013750; GHMP_kinase_C."
352 /rs_xr="InterPro; IPR006203; GHMP_knse_ATP-bd_CS."
353 /rs_xr="InterPro; IPR000870; Homoserine_kin."
354 /rs_xr="InterPro; IPR020568; Ribosomal_S5_D2-typ_fold."
355 /rs_xr="InterPro; IPR014721;
356 Ribosomal_S5_D2-typ_fold_subgr."
357 /rs_xr="Gene3D; G3DSA:3.30.230.10;
358 Ribosomal_S5_D2-type_fold; 1."
359 /rs_xr="Pfam; PF08544; GHMP_kinases_C; 1."
360 /rs_xr="Pfam; PF00288; GHMP_kinases_N; 1."
361 /rs_xr="PIRSF; PIRSF000676; Homoser_kin; 1."
362 /rs_xr="PRINTS; PR00958; HOMSERKINASE."
363 /rs_xr="SUPFAM; SSF54211; Ribosomal_S5_D2-typ_fold; 1."
364 /rs_xr="TIGRFAMs; TIGR00191; thrB; 1."
365 /rs_xr="PROSITE; PS00627; GHMP_KINASES_ATP; 1."
366 /transl_table="11"
367 /translation="MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETF
368 SLNNLGRFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSACS
369 VVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDI
370 ISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGFIHACYSRQ
371 PELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETA
372 QRVADWLGKNYLQNQEGFVHICRLDTAGARVLEN"
373
374 [Part of this file has been deleted for brevity]
375
376 4639201 gcgcagtcgg gcgaaatatc attactacgc cacgccagtt gaactggtgc cgctgttaga
377 4639261 ggaaaaatct tcatggatga gccatgccgc gctggtgttt ggtcgcgaag attccgggtt
378 4639321 gactaacgaa gagttagcgt tggctgacgt tcttactggt gtgccgatgg tggcggatta
379 4639381 tccttcgctc aatctggggc aggcggtgat ggtctattgc tatcaattag caacattaat
380 4639441 acaacaaccg gcgaaaagtg atgcaacggc agaccaacat caactgcaag ctttacgcga
381 4639501 acgagccatg acattgctga cgactctggc agtggcagat gacataaaac tggtcgactg
382 4639561 gttacaacaa cgcctggggc ttttagagca acgagacacg gcaatgttgc accgtttgct
383 4639621 gcatgatatt gaaaaaaata tcaccaaata aaaaacgcct tagtaagtat ttttc
384 //
385
386 Data files
387
388 None.
389
390 Notes
391
392 None.
393
394 References
395
396 Arakawa, K., Mori, K., Ikeda, K., Matsuzaki, T., Konayashi, Y., and
397 Tomita, M. (2003) G-language Genome Analysis Environment: A Workbench
398 for Nucleotide Sequence Data Mining, Bioinformatics, 19, 305-306.
399
400 Arakawa, K. and Tomita, M. (2006) G-language System as a Platform for
401 large-scale analysis of high-throughput omics data, J. Pest Sci.,
402 31, 7.
403
404 Arakawa, K., Kido, N., Oshita, K., Tomita, M. (2010) G-language Genome
405 Analysis Environment with REST and SOAP Web Service Interfaces,
406 Nucleic Acids Res., 38, W700-W705.
407
408 Warnings
409
410 None.
411
412 Diagnostic Error Messages
413
414 None.
415
416 Exit status
417
418 It always exits with a status of 0.
419
420 Known bugs
421
422 None.
423
424 See also
425
426 entret Retrieve sequence entries from flatfile databases and files
427 seqret Read and write (return) sequences
428
429 Author(s)
430
431 Hidetoshi Itaya (celery@g-language.org)
432 Institute for Advanced Biosciences, Keio University
433 252-0882 Japan
434
435 Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
436 Institute for Advanced Biosciences, Keio University
437 252-0882 Japan
438
439 History
440
441 2012 - Written by Hidetoshi Itaya
442
443 Target users
444
445 This program is intended to be used by everyone and everything, from
446 naive users to embedded scripts.
447
448 Comments
449
450 None.
451