# HG changeset patch # User iuc # Date 1606684496 0 # Node ID 2e497a770bca73be42c9fb8aa6b41a45137b8c62 # Parent 2b3e65a4252f20a52d8ff2457006c552c7c30f54 "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tool_collections/snpsift/snpsift commit 200c7d062259a94a28c6a224586f59d1a5e08309" diff -r 2b3e65a4252f -r 2e497a770bca snpSift_filter.xml --- a/snpSift_filter.xml Wed Jan 16 15:35:26 2019 -0500 +++ b/snpSift_filter.xml Sun Nov 29 21:14:56 2020 +0000 @@ -1,4 +1,4 @@ - + Filter variants using arbitrary expressions snpSift_macros.xml @@ -8,54 +8,75 @@ '$output' ]]> -$expr#slurp +$filter_expression.expr#slurp - - - - - + + + + + + + + + + + + + + + + + + + - - + + + + - - - - - - - - - - - + + + + + + + + + - @@ -71,7 +92,6 @@ - @@ -83,7 +103,6 @@ - @@ -98,7 +117,6 @@ - @@ -111,11 +129,31 @@ + + + + + + + + + + + + + + + + + + + + 30) | (exists INDEL) | ( countHet() > 2 )". The actual expressions can be quite complex, so it allows for a lot of flexibility. +This tool provides a flexible solution for filtering the variants in a VCF input dataset through the use of arbitrary, possibly rather complex expressions. Some examples: @@ -127,43 +165,62 @@ (FILTER = 'PASS') | ( na FILTER ) +- *Variants that have either a QUAL score above 30, or are indel variants, or for which at least two samples have a heterozygous genotype called*:: + + (QUAL > 30) | (exists INDEL) | ( countHet() > 2 ) + +- *Variants that are supported by at least 10 reads (as calculated from the DP4 attribute in the INFO field through zero-based index-access to the multiple values)*:: + + (DP4[2] + DP4[3] >= 10) + +---- + +Sets: + +The tool can construct sets of values for use in expressions from text files listing one value per line. Variants can then be filtered based on whether a given field in the variant record has a value that's contained in a set. For example, the expression:: + + ( ID in SET[2] ) + +would filter variants based on whether their ID field value appears in the set parsed from the third dataset used for set construction (the first set can be addressed with index ``[0]``, the second with index ``[1]``, and so on). + +---- + +Genotype-based filtering: + +Genotypes of specific samples can be accessed via zero-based indexing or via sample names. + +- *I want to keep samples where the genotype for the first sample is homozygous variant and the genotype for the second sample is reference*:: + + (isHom( GEN[0] ) & isVariant( GEN[0] ) & isRef( GEN[1] )) + +---- + +Filtering based on SnpEff annotations (``ANN`` or ``EFF`` fields): + - *I want to filter lines with an ANN annotation EFFECT of 'frameshift_variant' ( for vcf files using Sequence Ontology terms )*:: ( ANN[*].EFFECT has 'frameshift_variant' ) - **Important** According to the specification, there can be more than one EFFECT separated by & (e.g. 'missense_variant&splice_region_variant', thus using has operator is better than using equality operator (=). For instance 'missense_variant&splice_region_variant' = 'missense_variant' is false, whereas 'missense_variant&splice_region_variant' has 'missense_variant' is true. + .. class:: infomark + + According to the specification, there can be more than one EFFECT separated by ``&`` (e.g. ``'missense_variant&splice_region_variant'``), thus using the ``has`` operator is better than using the equality operator (``=``). For instance, ``'missense_variant&splice_region_variant' = 'missense_variant'`` is false, whereas ``'missense_variant&splice_region_variant' has 'missense_variant'`` is true. - *I want to filter lines with an EFF of 'FRAME_SHIFT' ( for vcf files using Classic Effect names )*:: ( EFF[*].EFFECT = 'FRAME_SHIFT' ) -- *I want to filter out samples with quality less than 30*:: - ( QUAL > 30 ) - -- *...but we also want InDels that have quality 20 or more*:: - - (( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 ) - -- *...or any homozygous variant present in more than 3 samples*:: - - (countHom() > 3) | (( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 ) +.. class:: infomark -- *...or any heterozygous sample with coverage 25 or more*:: - - ((countHet() > 0) & (DP >= 25)) | (countHom() > 3) | (( exists INDEL ) & (QUAL >= 20)) | (QUAL >= 30 ) - -- *I want to keep samples where the genotype for the first sample is homozygous variant and the genotype for the second sample is reference*:: +For information regarding HGVS and Sequence Ontology terms versus classic names: - (isHom( GEN[0] ) & isVariant( GEN[0] ) & isRef( GEN[1] )) - -**For information regarding HGVS and Sequence Ontology terms versus classic names**: - -- http://snpeff.sourceforge.net/SnpEff_manual.html#cmdline for the options: -classic, -hgvs, and -sequenceOntology -- http://snpeff.sourceforge.net/SnpEff_manual.html#input for the table containing the classic name and sequence onology term for each effect +- https://pcingola.github.io/SnpEff/se_commandline/ for the options: ``-classic``, ``-hgvs``, and ``-sequenceOntology`` +- https://pcingola.github.io/SnpEff/se_inputoutput/#effect-prediction-details for the table containing the classic name and sequence onology term for each effect @EXTERNAL_DOCUMENTATION@ -- http://snpeff.sourceforge.net/SnpSift.html#filter +- https://pcingola.github.io/SnpEff/ss_filter/ + +The second link in particular has further details and more examples about the tool's expression syntax. ]]> diff -r 2b3e65a4252f -r 2e497a770bca test-data/test_set1.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/test_set1.txt Sun Nov 29 21:14:56 2020 +0000 @@ -0,0 +1,3 @@ +7268 +7283 +7335 diff -r 2b3e65a4252f -r 2e497a770bca test-data/test_set2.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/test_set2.txt Sun Nov 29 21:14:56 2020 +0000 @@ -0,0 +1,2 @@ +12474 +12483