view GEMBASSY-1.0.3/doc/text/genret.txt @ 1:84a17b3fad1f draft

Uploaded
author ktnyt
date Fri, 26 Jun 2015 05:20:29 -0400
parents 8300eb051bea
children
line wrap: on
line source

                                     genret
Function

   Retrieves various gene related information from genome flatfile

Description

   genret reads in one or more genome flatfiles and retrieves various data from
   the input file. It is a wrapper program to the G-language REST service,
   where a method is specified by giving a string to the "method" qualifier. By
   default, genret will parse the input file to retrieve the accession ID
   (or name) of the genome to query G-language REST service. By setting the
   "accid" qualifier to false (or 0), genret will instead parse the sequence
   and features of the genome to create a GenBank formatted flatfile and upload
   the file to the G-language web server. Using the file uploaded, genret will
   execute the method provided.

   genret is able to perform a variety of tasks, incluing the retrieval of
   sequence upstream, downstream, or around the start or stop codon,
   translated gene sequences search of gene data by keyword, and re-annotation
   and retrieval of genome flatfiles. The set of genes can be given as flat
   text, regular expression, or a file containing the list of genes.

   Details on G-language REST service is available from the wiki page

   http://www.g-language.org/wiki/rest

   Documentation on G-language Genome Analysis Environment methods are
   provided at the Document Center

   http://ws.g-language.org/gdoc/

Usage

   Here is a sample session with genret

   Retrieving sequences upstream, downstream, or around the start/stop codons. 
   The following example shows the retrieval of sequence around the start
   codons of all genes.

   Genes to access are specified by regular expression. '*' stands for every
   gene.

   Available methods are:
      after_startcodon
      after_stopcodon
      around_startcodon
      around_stopcodon
      before_startcodon
      before_stopcodon

% genret
Retrieves various gene related information from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
Gene name(s) to lookup [*]:
Feature to access: around_startcodon
Full text output file [nc_000913.around_startcodon]:

   Go to the input files for this example
   Go to the output files for this example

   Example 2

   Using flat text as target genes. The names can be split with with a space,
   comma, or vertical bar.

% genret
Retrieves various gene related information from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
List of gene name(s) to report [*]: recA,recB
Name of gene feature to access: translation
Sequence output file [nc_000913.translation.genret]: stdout
>recA
MAIDENKQKALAAALGQIEKQFGKGSIMRLGEDRSMDVETISTGSLSLDIALGAGGLPMGR
IVEIYGPESSGKTTLTLQVIAAAQREGKTCAFIDAEHALDPIYARKLGVDIDNLLCSQPDT
GEQALEICDALARSGAVDVIVVDSVAALTPKAEIEGEIGDSHMGLAARMMSQAMRKLAGNL
KQSNTLLIFINQIRMKIGVMFGNPETTTGGNALKFYASVRLDIRRIGAVKEGENVVGSETR
VKVVKNKIAAPFKQAEFQILYGEGINFYGELVDLGVKEKLIEKAGAWYSYKGEKIGQGKAN
ATAWLKDNPETAKEIEKKVRELLLSNPNSTPDFSVDDSEGVAETNEDF
>recB
MSDVAETLDPLRLPLQGERLIEASAGTGKTFTIAALYLRLLLGLGGSAAFPRPLTVEELLV
VTFTEAATAELRGRIRSNIHELRIACLRETTDNPLYERLLEEIDDKAQAAQWLLLAERQMD
EAAVFTIHGFCQRMLNLNAFESGMLFEQQLIEDESLLRYQACADFWRRHCYPLPREIAQVV
FETWKGPQALLRDINRYLQGEAPVIKAPPPDDETLASRHAQIVARIDTVKQQWRDAVGELD
ALIESSGIDRRKFNRSNQAKWIDKISAWAEEETNSYQLPESLEKFSQRFLEDRTKAGGETP
RHPLFEAIDQLLAEPLSIRDLVITRALAEIRETVAREKRRRGELGFDDMLSRLDSALRSES
GEVLAAAIRTRFPVAMIDEFQDTDPQQYRIFRRIWHHQPETALLLIGDPKQAIYAFRGADI
FTYMKARSEVHAHYTLDTNWRSAPGMVNSVNKLFSQTDDAFMFREIPFIPVKSAGKNQALR
FVFKGETQPAMKMWLMEGESCGVGDYQSTMAQVCAAQIRDWLQAGQRGEALLMNGDDARPV
RASDISVLVRSRQEAAQVRDALTLLEIPSVYLSNRDSVFETLEAQEMLWLLQAVMTPEREN
TLRSALATSMMGLNALDIETLNNDEHAWDVVVEEFDGYRQIWRKRGVMPMLRALMSARNIA
ENLLATAGGERRLTDILHISELLQEAGTQLESEHALVRWLSQHILEPDSNASSQQMRLESD
KHLVQIVTIHKSKGLEYPLVWLPFITNFRVQEQAFYHDRHSFEAVLDLNAAPESVDLAEAE
RLAEDLRLLYVALTRSVWHCSLGVAPLVRRRGDKKGDTDVHQSALGRLLQKGEPQDAAGLR
TCIEALCDDDIAWQTAQTGDNQPWQVNDVSTAELNAKTLQRLPGDNWRVTSYSGLQQRGHG
IAQDLMPRLDVDAAGVASVVEEPTLTPHQFPRGASPGTFLHSLFEDLDFTQPVDPNWVREK
LELGGFESQWEPVLTEWITAVLQAPLNETGVSLSQLSARNKQVEMEFYLPISEPLIASQLD
TLIRQFDPLSAGCPPLEFMQVRGMLKGFIDLVFRHEGRYYLLDYKSNWLGEDSSAYTQQAM
AAAMQAHRYDLQYQLYTLALHRYLRHRIADYDYEHHFGGVIYLFLRGVDKEHPQQGIYTTR
PNAGLIALMDEMFAGMTLEEA

   Example 3

   Using a file with a list of gene names.
   The following example will retrieve the strand direction for each gene
   listed in the "gene_list.txt" file. String prefixed with an "@" or "list::"
   will be interpreted as file names.

% genret
Retrieves various gene features from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
List of gene name(s) to report [*]: @gene_list.txt
Name of gene feature to access: direction
Full text output file [nc_000913.direction]: stdout
gene,direction
thrA,direct
thrB,direct
thrC,direct

   Go to the input files for this example
   Go to the output files for this example

   Example 4

   Retrieving translations of coding sequences.
   The following example will retrieve the translated protein sequence of
   the "recA" gene.

% genret
Retrieves various gene related information from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
Gene name(s) to lookup [*]: recA
Feature to access: translation
Full text output file [nc_000913.translation]: stdout
>recA
MAIDENKQKALAAALGQIEKQFGKGSIMRLGEDRSMDVETISTGSLSLDIALGAGGLPMGR
IVEIYGPESSGKTTLTLQVIAAAQREGKTCAFIDAEHALDPIYARKLGVDIDNLLCSQPDT
GEQALEICDALARSGAVDVIVVDSVAALTPKAEIEGEIGDSHMGLAARMMSQAMRKLAGNL
KQSNTLLIFINQIRMKIGVMFGNPETTTGGNALKFYASVRLDIRRIGAVKEGENVVGSETR
VKVVKNKIAAPFKQAEFQILYGEGINFYGELVDLGVKEKLIEKAGAWYSYKGEKIGQGKAN
ATAWLKDNPETAKEIEKKVRELLLSNPNSTPDFSVDDSEGVAETNEDF

   Example 5

   Retrieving feature information of the genes.
   The following example will retrieve the start positions for each gene.
   The values for the keys in GenBank format is available for retrieval.
   (ex. start end direction GO* etc.)
   Positions will be returned with a 1 start value.

% genret
Retrieves various gene related information from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
Gene name(s) to lookup [*]:
Feature to access: start
Full text output file [nc_000913.start]:

   Go to the input files for this example
   Go to the output files for this example

   Example 6

   Passing extra arguments to the methods.
   The following example shows the retrieval of 30 base pairs around the
   start codon of the "recA" gene. By default, the "around_startcodon" method
   returns 200 base pairs around the start codon. Using the "-argument"
   qualifier allows the user to change this value.

% genret refseqn:NC_000913 recA around_startcodon -argument 30,30 stdout
Retrieves various gene features from genome flatfile
>recA
ccggtattacccggcatgacaggagtaaaaatggctatcgacgaaaacaaacagaaagcgt
tg

   Example 7

   Re-annotating a flatfile.
   genret supports re-annotation of a genome flatfile via Restauro-G
   service developed by our team. Using the BLAST Like Alignment Tool,
   to refer the UniProt KB and annotates information including the description,
   comments, feature tables, cross references, COG family, position, and Pfam.
   The original software is available at [http://restauro-g.iab.keio.ac.jp].
   

% genret refseqn:NC_000913 '*' annotate nc_000913-annotate.gbk
Retrieves various gene features from genome flatfile

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          seqall     Nucleotide sequence(s) filename and optional
                                  format, or reference (input USA)
  [-gene]              string     [*] Gene name(s) to lookup (Any string)
  [-access]            string     Feature to access (Any string)
  [-outfile]           outfile    [*.genret] Full text output file

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers:
   -argument           string     Option to give to method (Any string)
   -[no]accid          boolean    [Y] Include to use sequence accession ID as
                                  query

   General qualifiers:
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose

Input file format

   Database definitions for the examples are included in the embossrc_template
   file of the Keio Bioinformatcs Web Service (KBWS) package.

   Input files for usage example 4

   File: gene_list.txt

thrA
thrB
thrC

Output file format

   Output files for usage example 1

   File: nc_000913.around_startcodon

>thrL
cgtgagtaaattaaaattttattgacttaggtcactaaatactttaaccaatataggcata
gcgcacagacagataaaaattacagagtacacaacatccatgaaacgcattagcaccacca
ttaccaccaccatcaccattaccacaggtaacggtgcgggctgacgcgtacaggaaacaca
gaaaaaagcccgcacctgac
>thrA
aggtaacggtgcgggctgacgcgtacaggaaacacagaaaaaagcccgcacctgacagtgc
gggctttttttttcgaccaaaggtaacgaggtaacaaccatgcgagtgttgaagttcggcg
gtacatcagtggcaaatgcagaacgttttctgcgtgttgccgatattctggaaagcaatgc
caggcaggggcaggtggcca

   [Part of this file has been deleted for brevity]

>yjjY
tgcatgtttgctacctaaattgccaactaaatcgaaacaggaagtacaaaagtccctgacc
tgcctgatgcatgctgcaaattaacatgatcggcgtaacatgactaaagtacgtaattgcg
ttcttgatgcactttccatcaacgtcaacaacatcattagcttggtcgtgggtactttccc
tcaggacccgacagtgtcaa
>yjtD
tttttctgcgacttacgttaagaatttgtaaattcgcaccgcgtaataagttgacagtgat
cacccggttcgcggttatttgatcaagaagagtggcaatatgcgtataacgattattctgg
tcgcacccgccagagcagaaaatattggggcagcggcgcgggcaatgaaaacgatggggtt
tagcgatctgcggattgtcg

   Output files for usage example 5

   File: nc_000913.start

gene,start
thrL,190
thrA,337
thrB,2801
thrC,3734
yaaX,5234
yaaA,5683
yaaJ,6529
talB,8238
mog,9306

   [Part of this file has been deleted for brevity]

yjjX,4631256
ytjC,4631820
rob,4632464
creA,4633544
creB,4634030
creC,4634719
creD,4636201
arcA,4637613
yjjY,4638425
yjtD,4638965

   Output files for usage example 7

   File: ecoli-annotate.gbk

LOCUS       NC_000913            4639675 bp    DNA     circular BCT 25-OCT-2010
DEFINITION  Escherichia coli str. K-12 substr. MG1655 chromosome, complete
            genome.
ACCESSION   NC_000913
VERSION     NC_000913.2  GI:49175990
DBLINK      Project: 57779
KEYWORDS    .
SOURCE      Escherichia coli str. K-12 substr. MG1655
  ORGANISM  Escherichia coli str. K-12 substr. MG1655
            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;

   [Part of this file has been deleted for brevity]

     CDS             2801..3733
                     /EC_number="2.7.1.39"
                     /codon_start="1"
                     /db_xref="GI:16127997"
                     /db_xref="ASAP:ABE-0000010"
                     /db_xref="UniProtKB/Swiss-Prot:P00547"
                     /db_xref="ECOCYC:EG10999"
                     /db_xref="EcoGene:EG10999"
                     /db_xref="GeneID:947498"
                     /function="enzyme; Amino acid biosynthesis: Threonine"
                     /function="1.5.1.8 metabolism; building block
                     biosynthesis; amino acids; threonine"
                     /function="7.1 location of gene products; cytoplasm"
                     /gene="thrB"
                     /gene_synonym="ECK0003; JW0002"
                     /locus_tag="b0003"
                     /note="GO_component: GO:0005737 - cytoplasm; GO_process:
                     GO:0009088 - threonine biosynthetic process"
                     /product="homoserine kinase"
                     /protein_id="NP_414544.1"
                     /rs_com="FUNCTION: Catalyzes the ATP-dependent
                     phosphorylation of L- homoserine to L-homoserine
                     phosphate (By similarity)."
                     /rs_com="CATALYTIC ACTIVITY: ATP + L-homoserine = ADP +
                     O-phospho-L- homoserine."
                     /rs_com="PATHWAY: Amino-acid biosynthesis; L-threonine
                     biosynthesis; L- threonine from L-aspartate: step 4/5."
                     /rs_com="SUBCELLULAR LOCATION: Cytoplasm (Potential)."
                     /rs_com="SIMILARITY: Belongs to the GHMP kinase family.
                     Homoserine kinase subfamily."
                     /rs_des="RecName: Full=Homoserine kinase; Short=HK;
                     Short=HSK; EC=2.7.1.39;"
                     /rs_protein="Level 1: similar to KHSE_ECODH 1.7e-180"
                     /rs_xr="EMBL; CP000948; ACB01208.1; -; Genomic_DNA."
                     /rs_xr="RefSeq; YP_001728986.1; -."
                     /rs_xr="ProteinModelPortal; B1XBC8; -."
                     /rs_xr="SMR; B1XBC8; 2-308."
                     /rs_xr="EnsemblBacteria; EBESCT00000012034;
                     EBESCP00000011562; EBESCG00000011096."
                     /rs_xr="GeneID; 6058639; -."
                     /rs_xr="GenomeReviews; CP000948_GR; ECDH10B_0003."
                     /rs_xr="KEGG; ecd:ECDH10B_0003; -."
                     /rs_xr="HOGENOM; HBG646290; -."
                     /rs_xr="OMA; GSAHADN; -."
                     /rs_xr="ProtClustDB; PRK01212; -."
                     /rs_xr="BioCyc; ECOL316385:ECDH10B_0003-MONOMER; -."
                     /rs_xr="GO; GO:0005737; C:cytoplasm;
                     IEA:UniProtKB-SubCell."
                     /rs_xr="GO; GO:0005524; F:ATP binding; IEA:UniProtKB-KW."
                     /rs_xr="GO; GO:0004413; F:homoserine kinase activity;
                     IEA:EC."
                     /rs_xr="GO; GO:0009088; P:threonine biosynthetic process;
                     IEA:UniProtKB-KW."
                     /rs_xr="HAMAP; MF_00384; Homoser_kinase; 1; -."
                     /rs_xr="InterPro; IPR006204; GHMP_kinase."
                     /rs_xr="InterPro; IPR013750; GHMP_kinase_C."
                     /rs_xr="InterPro; IPR006203; GHMP_knse_ATP-bd_CS."
                     /rs_xr="InterPro; IPR000870; Homoserine_kin."
                     /rs_xr="InterPro; IPR020568; Ribosomal_S5_D2-typ_fold."
                     /rs_xr="InterPro; IPR014721;
                     Ribosomal_S5_D2-typ_fold_subgr."
                     /rs_xr="Gene3D; G3DSA:3.30.230.10;
                     Ribosomal_S5_D2-type_fold; 1."
                     /rs_xr="Pfam; PF08544; GHMP_kinases_C; 1."
                     /rs_xr="Pfam; PF00288; GHMP_kinases_N; 1."
                     /rs_xr="PIRSF; PIRSF000676; Homoser_kin; 1."
                     /rs_xr="PRINTS; PR00958; HOMSERKINASE."
                     /rs_xr="SUPFAM; SSF54211; Ribosomal_S5_D2-typ_fold; 1."
                     /rs_xr="TIGRFAMs; TIGR00191; thrB; 1."
                     /rs_xr="PROSITE; PS00627; GHMP_KINASES_ATP; 1."
                     /transl_table="11"
                     /translation="MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETF
                     SLNNLGRFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSACS
                     VVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDI
                     ISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGFIHACYSRQ
                     PELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETA
                     QRVADWLGKNYLQNQEGFVHICRLDTAGARVLEN"

   [Part of this file has been deleted for brevity]

  4639201 gcgcagtcgg gcgaaatatc attactacgc cacgccagtt gaactggtgc cgctgttaga
  4639261 ggaaaaatct tcatggatga gccatgccgc gctggtgttt ggtcgcgaag attccgggtt
  4639321 gactaacgaa gagttagcgt tggctgacgt tcttactggt gtgccgatgg tggcggatta
  4639381 tccttcgctc aatctggggc aggcggtgat ggtctattgc tatcaattag caacattaat
  4639441 acaacaaccg gcgaaaagtg atgcaacggc agaccaacat caactgcaag ctttacgcga
  4639501 acgagccatg acattgctga cgactctggc agtggcagat gacataaaac tggtcgactg
  4639561 gttacaacaa cgcctggggc ttttagagca acgagacacg gcaatgttgc accgtttgct
  4639621 gcatgatatt gaaaaaaata tcaccaaata aaaaacgcct tagtaagtat ttttc
//

Data files

   None.

Notes

   None.

References

   Arakawa, K., Mori, K., Ikeda, K., Matsuzaki, T., Konayashi, Y., and
      Tomita, M. (2003) G-language Genome Analysis Environment: A Workbench
      for Nucleotide Sequence Data Mining, Bioinformatics, 19, 305-306.

   Arakawa, K. and Tomita, M. (2006) G-language System as a Platform for
      large-scale analysis of high-throughput omics data, J. Pest Sci.,
      31, 7.

   Arakawa, K., Kido, N., Oshita, K., Tomita, M. (2010) G-language Genome
      Analysis Environment with REST and SOAP Web Service Interfaces,
      Nucleic Acids Res., 38, W700-W705.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with a status of 0.

Known bugs

   None.

See also

   entret Retrieve sequence entries from flatfile databases and files
   seqret Read and write (return) sequences

Author(s)

   Hidetoshi Itaya (celery@g-language.org)
   Institute for Advanced Biosciences, Keio University
   252-0882 Japan

   Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
   Institute for Advanced Biosciences, Keio University
   252-0882 Japan

History

   2012 - Written by Hidetoshi Itaya

Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.

Comments

   None.