Mercurial > repos > padge > clipkit
comparison clipkit_repo/docs/performance_assessment/index.rst @ 0:49b058e85902 draft
"planemo upload for repository https://github.com/jlsteenwyk/clipkit commit cbe1e8577ecb1a46709034a40dff36052e876e7a-dirty"
| author | padge |
|---|---|
| date | Fri, 25 Mar 2022 13:04:31 +0000 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:49b058e85902 |
|---|---|
| 1 .. _performance: | |
| 2 | |
| 3 | |
| 4 Performance Assessment | |
| 5 ====================== | |
| 6 | |
| 7 | | |
| 8 | |
| 9 Benchmarking | |
| 10 ------------ | |
| 11 | |
| 12 ^^^^^ | |
| 13 | |
| 14 In brief, performance assessment and comparison of multiple trimming alignment software | |
| 15 revealed that ClipKIT is a top-performing software. | |
| 16 | |
| 17 .. image:: ../_static/img/Performance_summary.jpg | |
| 18 | |
| 19 **ClipKIT is a top-performing software for trimming multiple sequence alignments.** | |
| 20 Across a total of 138,152 multiple sequence alignments (MSAs) from empirical (left) and | |
| 21 simulated (right) datasets, desirability-based integration of accuracy and support metrics | |
| 22 per MSA facilitated the comparison of relative software performance and revealed ClipKIT | |
| 23 is a top-performing software. MSA trimming approaches are ordered along the x-axis from | |
| 24 the highest-performing software (left) to the lowest-performing software (right) according to average | |
| 25 desirability-based rank, which is derived from measures of tree accuracy (i.e., normalized Robinson | |
| 26 Foulds distance) and tree support (i.e., average bipartition support). | |
| 27 | |
| 28 Abbreviations of trimmers and parameters are as follows: | |
| 29 ClipKIT: g = gappy mode; ClipKIT: kc = kpic; ClipKIT: kcg = kpic-gappy; ClipKIT: k = kpi mode; | |
| 30 ClipKIT: kg = kpi-gappy mode; BMGE = BMGE default; BMGE 0.3 = 0.3 entropy threshold; | |
| 31 BMGE 0.7 = 0.7 entropy threshold; trimAl: s = strict; trimAl: sp = strictplus; Noisy = default; | |
| 32 Gblocks = default; No trim = no trimming. | |
| 33 | |
| 34 For additional details about performance assessment, please see *ClipKIT: a multiple sequence | |
| 35 alignment trimming software for accurate phylogenomic inference*. Steenwyk et al. PLoS Biology. doi: |doiLink|_. | |
| 36 | |
| 37 .. _doiLink: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007 | |
| 38 .. |doiLink| replace:: 10.1371/journal.pbio.3001007 | |
| 39 | |
| 40 | | |
| 41 | |
| 42 smart-gap | |
| 43 --------- | |
| 44 | |
| 45 ^^^^^ | |
| 46 | |
| 47 Starting with version 1.1.0, a dynamic gappyness threshold determination approach (referred to | |
| 48 as smart-gap) has been introduced into ClipKIT and is now the default trimming approach. The | |
| 49 motivation of smart-gap stems from excessive trimming among highly divergent sequences. | |
| 50 | |
| 51 .. image:: ../_static/img/smart_gaps_trimming.png | |
| 52 | |
| 53 For example, in the figure above, we simulated 100 sequences for various trees with 100 tips. | |
| 54 Each tree had a different total tree length, a measure of total evolutionary divergence (x-axis). | |
| 55 Differences in total tree length were generated by multiplying the branch lengths of the starting | |
| 56 random tree (generated using IQTREE2) by a factor ranging from 0.25 to 10. Thus, the same tree | |
| 57 shape and relative branch lengths were used during the simulations. Simulations were generated using | |
| 58 INDELible. Examining the percentage of the alignment remaining after trimming revealed using a strict | |
| 59 gappy threshold of 90% resulted in 'extreme' trimming, which is not recommended (|TanLink|_). | |
| 60 In contrast, smart-gap retains a large fraction of the alignment and only removes the most | |
| 61 gappy sites. Thus, smart-gap is a better approach for sequence alignments that span deep and | |
| 62 shallow evolutionary timescales. | |
| 63 | |
| 64 More specifically, when implementing the smart-gap approach, ClipKIT first examines the | |
| 65 distribution of gaps across the alignment. Next, ClipKIT determines the gap-to-gap slope | |
| 66 between each gappyness bin. By examining the maximum difference in the slope between each | |
| 67 adjacent bin, ClipKIT determines what step would correspond to removing a large number | |
| 68 of sites in comparison to other steps. Of note, ClipKIT only examines the first half of | |
| 69 slopes calculated so as to not trim too much of the alignment. ClipKIT will then choose | |
| 70 the threshold that ensures the large number of sites will not be trimmed. | |
| 71 | |
| 72 .. _TanLink: https://academic.oup.com/sysbio/article/64/5/778/1685763 | |
| 73 .. |TanLink| replace:: Tan *et al.* (2015) | |
| 74 | |
| 75 For example, in the the following test alignment: | |
| 76 | |
| 77 .. code-block:: shell | |
| 78 | |
| 79 >1 | |
| 80 A-GTAT- | |
| 81 >2 | |
| 82 A-G-AT- | |
| 83 >3 | |
| 84 A-G-TA- | |
| 85 >4 | |
| 86 AGA-TA- | |
| 87 >5 | |
| 88 ACa-T-G | |
| 89 | |
| 90 there are two sites with four gaps, one site with three gaps, and one | |
| 91 site with one gap. ClipKIT will calculate the slope between sites with | |
| 92 greater than or equal to 80% gaps and removing 2/7ths of the alignment | |
| 93 and sites with greater than or equal to 60% gaps and removing 3/7ths | |
| 94 of the alignment. Next, ClipKIT will determine the slope between sites | |
| 95 with greater than or equal to 60% gaps and removing 3/7ths of the | |
| 96 alignment and sites with greater than or equal to 20% gaps and removing | |
| 97 4/7ths of the alignment and so on and so forth. ClipKIT will then examine | |
| 98 the first half of slope values and use the less strict gaps threshold | |
| 99 from the two points that generated the greatest difference between | |
| 100 consecutive slopes. | |
| 101 | |
| 102 |
