# HG changeset patch # User iuc # Date 1686948761 0 # Node ID 5532c0e5d4a699c60c915736a90db1eace96b8df # Parent 85c4115461235fae8e14fa0631b6fbe134d59998 planemo upload for repository https://gitlab.com/paulklemm_PHD/proteinortho commit b4d8b8da2a259973c9ad90e4b9d1a3e22ae4348f diff -r 85c411546123 -r 5532c0e5d4a6 proteinortho.xml --- a/proteinortho.xml Tue Nov 22 16:49:50 2022 +0000 +++ b/proteinortho.xml Fri Jun 16 20:52:41 2023 +0000 @@ -2,24 +2,37 @@ detects orthologous proteins/genes within different species proteinortho_macros.xml - + + + + + + + + + + + + + + @@ -44,21 +57,23 @@ proteinortho --project=result --cpus="\${GALAXY_SLOTS:-4}" - --ram="\${GALAXY_MEMORY_MB:-16000}" #if $more_options.selfblast: $more_options.selfblast #end if #if $more_options.singles: $more_options.singles #end if + #if $more_options.core: + $more_options.core + #end if --p=$p - --e=$evalue + --e=$more_options.evalue --conn=$conn #if $more_options.cov: --cov=$more_options.cov #end if - #if $more_options.sim: - --sim=`LC_NUMERIC=C awk "BEGIN {printf \"%.2f\",$more_options.sim/100}"` + #if $sim: + --sim=`LC_NUMERIC=C awk "BEGIN {printf \"%.2f\",$sim/100}"` #end if #if $more_options.identity: --cov=$more_options.identity @@ -100,15 +115,16 @@ - - + +
+ - - + + @@ -116,7 +132,7 @@
- + @@ -124,75 +140,117 @@ - - + + - - - + + + + + + + + + + + + + + + + + + + + + + - - + + + + + - - + + - - - - - +
+ + + + + +
+ + +
- - - - + + + + + + + + + - + - + + + - + - + + + - + - + + + - + - + + + @@ -205,9 +263,9 @@ Proteinortho is a tool to detect orthologous proteins/genes within different species (at least 2). | It compares similarities of given gene/protein sequences and clusters them to find significant groups. - | The algorithm was designed to handle large-scale data and can be applied to hundreds of species at one. + | The algorithm was designed to handle large-scale data and can be applied to hundreds of species at once. | Details can be found in (doi:10.1186/1471-2105-12-124). - | To enhance the prediction accuracy, the relative order of genes (synteny) can be used as additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (details see doi:10.1371/journal.pone.0105015), is already build in Proteinortho. + | To enhance the prediction accuracy, the relative order of genes (synteny) can be used as an additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (details see doi:10.1371/journal.pone.0105015), is already built in Proteinortho. ---- @@ -218,13 +276,13 @@ * **(i) Build adaptive reciprocal best hit graph (RBH)** | Using the blast algorithm (diamond,blast,blat,...) all input sequences are compared against each other. - | If two proteins find each other with respect to multiple criteria like minimal evalue, similarity compared to the best hit, ... then a edge is drawn between the two proteins. + | If two proteins find each other with respect to multiple criteria like minimal evalue, and similarity compared to the best hit, ... then an edge is drawn between the two proteins. | The result of this step is outputted to RBH * **(ii) Cluster the RBH** | Using two clustering algorithms, edges are removed that weakly connect two connected components to reduce false positive hits. - | The resulting connected components are outputted in orthology-groups / -PAIRS + | The resulting connected components are outputted in orthology-groups / -pairs ---- @@ -235,35 +293,49 @@ * **RBH** | The result of the (i) step, the reciprocal best hit graph. - | First a comment line announces 2 species (# ecoli.faa human.faa), then each line corresponds to a reciprocal best hit between 2 proteins/genes of the announced species. The output format is shown below. + | First two comment line announces 2 species (# ecoli.faa human.faa) as well as the median values (evalue_ab,bitscore_ab,evalue_ba,bitscore_ba). + | Following these header lines, each line corresponds to a reciprocal best hit of 2 proteins/genes (columns 1 and 2) of the announced species. The output format is shown below. | *seqidA*,*seqidB* = the 2 ids/names of the proteins involved | *evalue_ab* = evalue with seqidA as query and seqidB as part of the database | *bitscore_ab* = bitscore with seqidA as query ... | *evalue_ba* = evalue with seqidB as query ... - | ... .. csv-table:: seqidA,seqidB,evalue_ab,bitscore_ab,evalue_ba,bitscore_ba + # ecoli.faa,human.faa + # 1.91e-112,357.5,1.825e-113,360 + L_10,C_10;test,4.32e-151,447,4.30e-151,446 + L_11,C_11,1.17e-68,209,3.00e-69,210 + L_14,C_14,3.64e-139,422,1.19e-142,431 + L_15,C_15,3.51e-100,303,2.12e-102,308 + L_16,C_16,3.75e-49,157,7.06e-50,159 + L_17,C_17,2.96e-195,578,5.50e-196,579 ---- * **orthology-groups** | The result of the (ii) step, the clustered reciprocal best hit graph or the orthology groups. - | Every line corresponds to an orthology group of proteins/genes. - | The first 3 columns characterize general properties of that group: number of proteins, species and the algebraic connectivity. The higher the algebraic connectivity the more edges are there and the better the group is connected to itself in general. - | Then a column for each species follows containing the proteins of that species. If a species contributes with more than one protein to a group of orthologs, then they are ordered by connectivity. + | Every line corresponds to an orthology group. + | The first 3 columns characterize the general properties of that group: number of proteins, species, and algebraic connectivity. The higher the algebraic connectivity the more edges are there and the better the group is connected to itself in general. + | Then a column for each species follows containing the proteins of these species. + | If a species contributes with more than one protein to a group of orthologs, then they are ordered by descending connectivity. + | The '*' represents that this species does not contribute to the group. .. csv-table:: - Species,Genes,Alg.-Conn. + Species,Genes,alg.-conn.,ecoli.faa,human.faa,snail.faa,wale.faa,ebola.faa + 5,5,0.715,C_10,C_10;test,E_10,L_10,M_10 + 4,6,0.115,*,C_12,E_315,L_313,M_313 + 4,5,0.167,*,C_63,E_19,L_19,M_19 + 4,4,0.816,*,C_64,E_18,L_18,M_18 ---- * **orthology-pairs** - | The same as orthology-groups but every edge is printed one-by-one here. The output is formatted the same as the RBH graph: + | The same as orthology-groups but every edge is printed one-by-one instead of the whole group. The output is formatted the same as the RBH graph: .. csv-table:: @@ -273,11 +345,17 @@ **Proteinortho-Tools for downstream analysis** -* `proteinortho grab proteins` : find gene(s)/protein(s) in a given fasta file and retrieve their sequence(s). You can also use a orthology-groups file. +* `proteinortho grab proteins` : find gene(s)/protein(s) in a given fasta file and retrieve their sequence(s). You can also use a orthology-groups file or a subset (e.g. filter by Species>10). * `proteinortho summary` : Summaries the orthology-pairs/RBH files to determine how the species are connected to each other. More information can be found on github https://gitlab.com/paulklemm_PHD/proteinortho + +**Citations:** + +- Lechner, Marcus, et al. "Proteinortho: detection of (co-) orthologs in large-scale analysis." BMC bioinformatics 12.1 (2011): 1-9. (10.1186/1471-2105-12-124) +- Lechner, Marcus, et al. "Orthology detection combining clustering and synteny for very large datasets." PLoS one 9.8 (2014): e105015. (10.1371/journal.pone.0105015) + ]]> - + diff -r 85c411546123 -r 5532c0e5d4a6 proteinortho_macros.xml --- a/proteinortho_macros.xml Tue Nov 22 16:49:50 2022 +0000 +++ b/proteinortho_macros.xml Fri Jun 16 20:52:41 2023 +0000 @@ -1,6 +1,6 @@ - 6.1.2 + 6.2.3 1 20.09 @@ -12,12 +12,10 @@ proteinortho - - diamond + diamond blast ucsc-blat - last + last