# HG changeset patch # User blankenberg # Date 1519851297 18000 # Node ID 5c852eca82e036e32cece1325d7ea2c0e68b1dcc # Parent cfc86c3fc5c8bc4b4607270b21129c58ba4fcde4 planemo upload for repository https://github.com/blankenberg/tools-blankenberg/tree/master/tools/naive_variant_caller commit a1f39a3e28911591f6a1ed58a43e95e0baf5e750 diff -r cfc86c3fc5c8 -r 5c852eca82e0 README.rst --- a/README.rst Fri Feb 17 11:42:07 2017 -0500 +++ b/README.rst Wed Feb 28 15:54:57 2018 -0500 @@ -1,4 +1,4 @@ -This repository contains the **Naive Variant Caller** tool. +This repository contains the **Naive Variant Caller** tool (NVC). ------ diff -r cfc86c3fc5c8 -r 5c852eca82e0 dependency_configs/tool_dependencies.xml --- a/dependency_configs/tool_dependencies.xml Fri Feb 17 11:42:07 2017 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,12 +0,0 @@ - - - - - - - - - - - - diff -r cfc86c3fc5c8 -r 5c852eca82e0 naive_variant_caller.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/naive_variant_caller.xml Wed Feb 28 15:54:57 2018 -0500 @@ -0,0 +1,232 @@ + + - tabulate variable sites from BAM datasets + + nvc + + + + + + naive_variant_caller.py --version + naive_variant_caller.py + -o "${output_vcf}" + + #for $input_bam in $reference_source.input_bams: + -b '${input_bam.input_bam}' + -i '${input_bam.input_bam.metadata.bam_index}' + #end for + + #if $reference_source.reference_source_selector != "history": + -r '${reference_source.ref_file.fields.path}' + #elif $reference_source.ref_file: + -r '${reference_source.ref_file}' + #end if + + #for $region in $regions: + --region '${region.chromosome}:${region.start}-${region.end}' + #end for + + #for $region_file in $region_files: + --regions_filename '${region_file.input_region}' + --regions_file_columns '${int($region_file.input_region.metadata.chromCol)-1},${int($region_file.input_region.metadata.startCol)-1},${int($region_file.input_region.metadata.endCol)-1}' + #end for + + ${variants_only} + + ${use_strand} + + --ploidy '${$ploidy}' + + --min_support_depth '${min_support_depth}' + + #if str($min_base_quality): + --min_base_quality '${min_base_quality}' + #end if + + #if str($min_mapping_quality): + --min_mapping_quality '${min_mapping_quality}' + #end if + + --allow_out_of_bounds_positions + + #if str( $advanced_options.advanced_options_selector ) == "advanced": + #if str( $advanced_options.coverage_dtype ) != "guess": + --coverage_dtype '${advanced_options.coverage_dtype}' + #end if + ${advanced_options.safe} + #end if + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +**What it does** + +This tool is a naive variant caller that processes aligned sequencing reads from the BAM format and produces a VCF file containing per position variant calls. This tool allows multiple BAM files to be provided as input and utilizes read group information to make calls for individual samples. + +User configurable options allow filtering reads that do not pass mapping or base quality thresholds and minimum per base read depth; user's can also specify the ploidy and whether to consider each strand separately. + +In addition to calling alternate alleles based upon simple ratios of nucleotides at a position, per base nucleotide counts are also provided. A custom tag, NC, is used within the Genotype fields. The NC field is a comma-separated listing of nucleotide counts in the form of <nucleotide>=<count>, where a plus or minus character is prepended to indicate strand, if the strandedness option was specified. + + +------ + +**Inputs** + +Accepts one or more BAM input files and a reference genome from the built-in list or from a FASTA file in your history. + + +**Outputs** + +The output is in VCF format. + +Example VCF output line, without reporting by strand: + ``chrM 16029 . T G,A,C . . AC=15,9,5;AF=0.00155311658729,0.000931869952371,0.000517705529095 GT:AC:AF:NC 0/0:15,9,5:0.00155311658729,0.000931869952371,0.000517705529095:A=9,C=5,T=9629,G=15,`` + +Example VCF output line, when reporting by strand: + ``chrM 16029 . T G,A,C . . AC=15,9,5;AF=0.00155311658729,0.000931869952371,0.000517705529095 GT:AC:AF:NC 0/0:15,9,5:0.00155311658729,0.000931869952371,0.000517705529095:+T=3972,-A=9,-C=5,-T=5657,-G=15,`` + +**Options** + +Reference Genome: + + Ensure that you have selected the correct reference genome, either from the list of built-in genomes or by selecting the corresponding FASTA file from your history. + +Restrict to regions: + + You can specify any number of regions on which you would like to receive results. You can specify just a chromosome name, or a chromosome name and start postion, or a chromosome name and start and end position for the set of desired regions. + +Minimum number of reads needed to consider a REF/ALT: + + This value declares the minimum number of reads containing a particular base at each position in order to list and use said allele in genotyping calls. Default is 0. + +Minimum base quality: + + The minimum base quality score needed for the position in a read to be used for nucleotide counts and genotyping. Default is no filter. + +Minimum mapping quality: + + The minimum mapping quality score needed to consider a read for nucleotide counts and genotyping. Default is no filter. + +Ploidy: + + The number of genotype calls to make at each reported position. + +Only write out positions with possible alternate alleles: + + When set, only positions which have at least one non-reference nucleotide which passes declare filters will be present in the output. + +Report counts by strand: + + When set, nucleotide counts (NC) will be reported in reference to the aligned read's source strand. Reported as: <strand><BASE>=<COUNT>. + +Choose the dtype to use for storing coverage information: + + This controls the maximum depth value for each nucleotide/position/strand (when specified). Smaller values require the least amount of memory, but have smaller maximal limits. + + +--------+----------------------------+ + | name | maximum coverage value | + +========+============================+ + | uint8 | 255 | + +--------+----------------------------+ + | uint16 | 65,535 | + +--------+----------------------------+ + | uint32 | 4,294,967,295 | + +--------+----------------------------+ + | uint64 | 18,446,744,073,709,551,615 | + +--------+----------------------------+ + + + + + 10.1186/gb4161 + + + diff -r cfc86c3fc5c8 -r 5c852eca82e0 tool-data/tool_data_table_conf.xml.sample --- a/tool-data/tool_data_table_conf.xml.sample Fri Feb 17 11:42:07 2017 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,7 +0,0 @@ - - - - line_type, value, path - -

- diff -r cfc86c3fc5c8 -r 5c852eca82e0 tool_data_table_conf.xml.sample --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_data_table_conf.xml.sample Wed Feb 28 15:54:57 2018 -0500 @@ -0,0 +1,7 @@ + + + + line_type, value, path + +

+ diff -r cfc86c3fc5c8 -r 5c852eca82e0 tools/naive_variant_caller.py --- a/tools/naive_variant_caller.py Fri Feb 17 11:42:07 2017 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,72 +0,0 @@ -#!/usr/bin/env python -#Dan Blankenberg -import sys -import optparse - -from pyBamParser.bam import Reader -from pyBamTools.genotyping.naive import VCFReadGroupGenotyper, PROGRAM_NAME, PROGRAM_VERSION - -def main(): - #Parse Command Line - parser = optparse.OptionParser() - parser.add_option( '-b', '--bam', dest='bam_file', action='append', type="string", default=[], help='BAM filename, optionally index filename. Multiple allowed.' ) - parser.add_option( '-i', '--index', dest='index_file', action='append', type="string", default=[], help='optionally index filename. Multiple allowed.' ) - parser.add_option( '-o', '--output_vcf_filename', dest='output_vcf_filename', action='store', default = None, type="string", help='Output VCF filename' ) - parser.add_option( '-r', '--reference_genome_filename', dest='reference_genome_filename', action='store', default = None, type="string", help='Input reference file' ) - parser.add_option( '-v', '--variants_only', dest='variants_only', action='store_true', default = False, help='Report only sites with a possible variant allele.' ) - parser.add_option( '-s', '--use_strand', dest='use_strand', action='store_true', default = False, help='Report counts by strand' ) - parser.add_option( '-p', '--ploidy', dest='ploidy', action='store', type="int", default=2, help='Ploidy. Default=2.' ) - parser.add_option( '-d', '--min_support_depth', dest='min_support_depth', action='store', type="int", default=0, help='Minimum number of reads needed to consider a REF/ALT. Default=0.' ) - parser.add_option( '-q', '--min_base_quality', dest='min_base_quality', action='store', type="int", default=None, help='Minimum base quality.' ) - parser.add_option( '-m', '--min_mapping_quality', dest='min_mapping_quality', action='store', type="int", default=None, help='Minimum mapping.' ) - parser.add_option( '-t', '--coverage_dtype', dest='coverage_dtype', action='store', type="string", default=None, help='dtype to use for coverage array' ) - parser.add_option( '--allow_out_of_bounds_positions', dest='allow_out_of_bounds_positions', action='store_true', default = False, help='Allows out of bounds positions to not throw fatal errors' ) - parser.add_option( '--safe', dest='safe', action='store_true', default = False, help='Perform checks to prevent certain errors. Is slower.' ) - parser.add_option( '--region', dest='region', action='append', type="string", default=[], help='region' ) - parser.add_option( '', '--version', dest='version', action='store_true', default = False, help='Report version and quit' ) - (options, args) = parser.parse_args() - - if options.version: - print "%s version %s" % ( PROGRAM_NAME, PROGRAM_VERSION ) - sys.exit(0) - - if len( options.bam_file ) == 0: - print >>sys.stderr, 'You must provide at least one bam (-b) file.' - parser.print_help( sys.stderr ) - sys.exit( 1 ) - if options.index_file: - assert len( options.index_file ) == len( options.bam_file ), "If you provide a name for an index file, you must provide the index name for all bam files." - bam_files = zip( options.bam_file, options.index_file ) - else: - bam_files = [ ( x, ) for x in options.bam_file ] - if not options.reference_genome_filename: - print >> sys.stderr, "Warning: Reference file has not been specified. Providing a reference genome is highly recommended." - if options.output_vcf_filename: - out = open( options.output_vcf_filename, 'wb' ) - else: - out = sys.stdout - - regions = [] - if options.region: - for region in options.region: - region_split = region.split( ":" ) - region = region_split.pop( 0 ) - if region_split: - region_split = filter( bool, region_split[0].split( '-' ) ) - if region_split: - if len( region_split ) != 2: - print >> sys.stderr, "You must specify both a start and an end, or only a chromosome when specifying regions." - cleanup_before_exit( tmp_dir ) - sys.exit( 1 ) - region = tuple( [ region ] + map( int, region_split ) ) - regions.append( region ) - - coverage = VCFReadGroupGenotyper( map( lambda x: Reader( *x ), bam_files ), options.reference_genome_filename, dtype=options.coverage_dtype, - min_support_depth=options.min_support_depth, min_base_quality=options.min_base_quality, - min_mapping_quality=options.min_mapping_quality, restrict_regions=regions, use_strand=options.use_strand, - allow_out_of_bounds_positions=options.allow_out_of_bounds_positions, safe=options.safe ) - for line in coverage.iter_vcf( ploidy=options.ploidy, variants_only=options.variants_only ): - out.write( "%s\n" % line ) - out.close() - -if __name__ == "__main__": main() diff -r cfc86c3fc5c8 -r 5c852eca82e0 tools/naive_variant_caller.xml --- a/tools/naive_variant_caller.xml Fri Feb 17 11:42:07 2017 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,226 +0,0 @@ - - - tabulate variable sites from BAM datasets - - numpy - pyBamParser - pyBamTools - - - - - - naive_variant_caller.py - -o "${output_vcf}" - - #for $input_bam in $reference_source.input_bams: - -b "${input_bam.input_bam}" - -i "${input_bam.input_bam.metadata.bam_index}" - #end for - - #if $reference_source.reference_source_selector != "history": - -r "${reference_source.ref_file.fields.path}" - #elif $reference_source.ref_file: - -r "${reference_source.ref_file}" - #end if - - #for $region in $regions: - --region "${region.chromosome}:${region.start}-${region.end}" - #end for - - ${variants_only} - - ${use_strand} - - --ploidy "${$ploidy}" - - --min_support_depth "${min_support_depth}" - - #if str($min_base_quality): - --min_base_quality "${min_base_quality}" - #end if - - #if str($min_mapping_quality): - --min_mapping_quality "${min_mapping_quality}" - #end if - - --allow_out_of_bounds_positions - - #if str( $advanced_options.advanced_options_selector ) == "advanced": - #if str( $advanced_options.coverage_dtype ) != "guess": - --coverage_dtype "${advanced_options.coverage_dtype}" - #end if - ${advanced_options.safe} - #end if - - naive_variant_caller.py --version - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -**What it does** - -This tool is a naive variant caller that processes aligned sequencing reads from the BAM format and produces a VCF file containing per position variant calls. This tool allows multiple BAM files to be provided as input and utilizes read group information to make calls for individual samples. - -User configurable options allow filtering reads that do not pass mapping or base quality thresholds and minimum per base read depth; user's can also specify the ploidy and whether to consider each strand separately. - -In addition to calling alternate alleles based upon simple ratios of nucleotides at a position, per base nucleotide counts are also provided. A custom tag, NC, is used within the Genotype fields. The NC field is a comma-separated listing of nucleotide counts in the form of <nucleotide>=<count>, where a plus or minus character is prepended to indicate strand, if the strandedness option was specified. - - ------- - -**Inputs** - -Accepts one or more BAM input files and a reference genome from the built-in list or from a FASTA file in your history. - - -**Outputs** - -The output is in VCF format. - -Example VCF output line, without reporting by strand: - ``chrM 16029 . T G,A,C . . AC=15,9,5;AF=0.00155311658729,0.000931869952371,0.000517705529095 GT:AC:AF:NC 0/0:15,9,5:0.00155311658729,0.000931869952371,0.000517705529095:A=9,C=5,T=9629,G=15,`` - -Example VCF output line, when reporting by strand: - ``chrM 16029 . T G,A,C . . AC=15,9,5;AF=0.00155311658729,0.000931869952371,0.000517705529095 GT:AC:AF:NC 0/0:15,9,5:0.00155311658729,0.000931869952371,0.000517705529095:+T=3972,-A=9,-C=5,-T=5657,-G=15,`` - -**Options** - -Reference Genome: - - Ensure that you have selected the correct reference genome, either from the list of built-in genomes or by selecting the corresponding FASTA file from your history. - -Restrict to regions: - - You can specify any number of regions on which you would like to receive results. You can specify just a chromosome name, or a chromosome name and start postion, or a chromosome name and start and end position for the set of desired regions. - -Minimum number of reads needed to consider a REF/ALT: - - This value declares the minimum number of reads containing a particular base at each position in order to list and use said allele in genotyping calls. Default is 0. - -Minimum base quality: - - The minimum base quality score needed for the position in a read to be used for nucleotide counts and genotyping. Default is no filter. - -Minimum mapping quality: - - The minimum mapping quality score needed to consider a read for nucleotide counts and genotyping. Default is no filter. - -Ploidy: - - The number of genotype calls to make at each reported position. - -Only write out positions with possible alternate alleles: - - When set, only positions which have at least one non-reference nucleotide which passes declare filters will be present in the output. - -Report counts by strand: - - When set, nucleotide counts (NC) will be reported in reference to the aligned read's source strand. Reported as: <strand><BASE>=<COUNT>. - -Choose the dtype to use for storing coverage information: - - This controls the maximum depth value for each nucleotide/position/strand (when specified). Smaller values require the least amount of memory, but have smaller maximal limits. - - +--------+----------------------------+ - | name | maximum coverage value | - +========+============================+ - | uint8 | 255 | - +--------+----------------------------+ - | uint16 | 65,535 | - +--------+----------------------------+ - | uint32 | 4,294,967,295 | - +--------+----------------------------+ - | uint64 | 18,446,744,073,709,551,615 | - +--------+----------------------------+ - - - - - - - - - - - - - - - - - - - - - - - 10.1186/gb4161 - - -