Mercurial > repos > iuc > map_damage

<tool id="map_damage" name="mapDamage" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="23.2">
    <description>Tracks and quantifies damage patterns in ancient DNA sequences</description>
    <macros>
        <token name="@TOOL_VERSION@">2.2.3</token>
        <token name="@VERSION_SUFFIX@">0</token>
    </macros>
    <xrefs>
        <xref type="bio.tools">mapdamage</xref>
    </xrefs>
    <requirements>
        <requirement type="package" version="@TOOL_VERSION@">mapdamage2</requirement>
    </requirements>
    <version_command><![CDATA[mapDamage --version]]></version_command>
    <command detect_errors="aggressive"><![CDATA[

        ln -s '$sbam_file' alignment.$sbam_file.ext &&
        #if $reference.ref_source == 'history':
            ln -s '$reference.history_reference' reference.fa &&
        #end if
    ## MAIN COMMAND LINE

        mapDamage

            ## INPUT / OUTPUT PARAMETERS

            --input alignment.$sbam_file.ext

            #if $reference.ref_source == 'builtin':
                --reference '$reference.builtin_reference.fields.path'
            #elif $reference.ref_source == 'history':
                --reference reference.fa
            #end if

            --folder 'mapDamage_results'

            ## GENERAL OPTIONS

            $merge_reference_sequences

            #if $downsampling.downsampling.downsampling_type != 'no_downsampling':
                --downsample $downsampling.downsampling.downsample
                #if $downsampling.downsampling.downsample_seed:
                    --downsample-seed $downsampling.downsampling.downsample_seed
                #end if
            #end if

            --length $window_size.length
            --around $window_size.around
            --min-basequal $min_basequal
            $fasta

            ## GRAPHICS OPTIONS

            --ymax $graphics.ymax
            --readplot $graphics.readplot
            --refplot $graphics.refplot

            #if str($graphics.title):
                --title '$graphics.title'
            #end if

            $graphics.theme_bw

            ## STATISTICS OPTIONS

            #if $statistics.stats.no_stats:
                $statistics.stats.no_stats
            #else:
                --rand $statistics.stats.rand
                --burn $statistics.stats.burn
                --adjust $statistics.stats.adjust
                --iter $statistics.stats.iter
                $statistics.stats.ends.ends_to_process

                #if not $statistics.stats.ends.ends_to_process:
                    $statistics.stats.ends.diff_hangs
                #end if

                --seq-length $statistics.stats.seq_length
                $statistics.stats.var_disp
                $statistics.stats.mutation_model
                $statistics.stats.nick_vector
            #end if

            ## RESCALING OPTIONS

            #if $rescale.rescaling.rescale:
                --rescale
                #if $rescale.rescaling.rescale_length_5p:
                    --rescale-length-5p $rescale.rescaling.rescale_length_5p
                #end if
                #if $rescale.rescaling.rescale_length_3p:
                    --rescale-length-3p $rescale.rescaling.rescale_length_3p
                #end if
            #end if

    ]]></command>

    <inputs>

        <!--INPUT FILES-->
        <param name="sbam_file" argument="--input" type="data" format="sam,bam" label="SAM / BAM input file of aligned reads to be analyzed." help="Must contain a valid header."/>

        <conditional name="reference">
            <param name="ref_source" type="select" label="Use a built-in FASTA reference or retrieve one from your history?">
                <option value="builtin" selected="true">Built-in reference FASTA file</option>
                <option value="history">From your history</option>
            </param>
            <when value="builtin">
                <param name="builtin_reference" type="select" label="FASTA genome to use as a reference in order to assess DNA damage" help="Contact your Galaxy team in order to add a reference">
                    <options from_data_table="all_fasta"></options>
                        <validator type="no_options" message="No FASTA references are available, please contact your Galaxy administrators."/>
                </param>
            </when>
            <when value="history">
                <param name="history_reference" type="data" format="fasta" label="FASTA genome to use as a reference in order to use DNA damage"/>
            </when>
        </conditional>

        <!--GENERAL OPTIONS-->

        <param argument="--merge-reference-sequences" type="boolean" truevalue="--merge-reference-sequences" falsevalue="" label="Merge reference sequences in result files?" help="Useful for memory usage."/>
        <param argument="--min-basequal" type="integer" min="0" max="93" value="0" label="Minimum PHRED score for base to be considered" help="(assumes PHRED+33 score)"/>
        <param argument="--fasta" type="boolean" checked="false" truevalue="--fasta" falsevalue="" label="Output alignments in FASTA format?"/>
        <section name="downsampling" title="Downsampling">
            <conditional name="downsampling">
                <param name="downsampling_type" type="select" label="Type of downsampling" help="Downsample using a percentage or a number of reads?">
                    <option value="no_downsampling" selected="true">No downsampling</option>
                    <option value="p">Percentage of reads</option>
                    <option value="n">Number of reads</option>
                </param>
                <when value="no_downsampling"/>
                <when value="p">
                    <param argument="--downsample" type="float" min="0" max="1" value="0.8" label="Percentage of reads to sample" help="Must be between 0 and 1 (0 and 1 not included)">
                        <validator type="in_range" min="0" exclude_min="true" max="1" exclude_max="true" message="Percentage of reads to downsample must be > 0 and &lt; 1"/>
                    </param>
                    <param argument="--downsample-seed" type="integer" optional="true" label="Downsampling seed" help="Seed (integer) used to randomly select reads for downsampling. Useful for reproducibility."/>
                </when>
                <when value="n">
                    <param argument="--downsample" type="integer" min="1" value="1000" label="Number of reads to sample" help="Must be superior or equal to 1"/>
                    <param argument="--downsample-seed" type="integer" optional="true" label="Downsampling seed" help="Seed (integer) used to randomly select reads for downsampling. Useful for reproducibility."/>
                </when>
            </conditional>
        </section>
        <section name="window_size" title="Analysis Window Size">
            <param argument="--length" type="integer" min="1" value="70" label="Number of nucleotides to process, starting from 5p and 3p end of read" help="(Bases which are located further than this from one of the two read ends will not be analyzed)"/>
            <param argument="--around" type="integer" min="0" value="10" label="Number of nucleotides to retrieve before and after read" help="(This is used in order to look at purine enrichment before strand breaks)"/>
        </section>

        <!--GRAPHICS OPTIONS-->

        <section name="graphics" title="Graphics Options">
            <param argument="--ymax" type="float" min="0" max="1" value="0.3" label="Graphical y-axis limit for nucleotide misincorporation frequency" help="(Bottom plot in the Fragmisincorporation_plot.pdf file)">
                <validator type="in_range" min="0" exclude_min="true" max="1" exclude_max="false" message="--ymax must be > 0 and &lt;= 1"/>
            </param>
            <param argument="--readplot" type="integer" min="0" value="25" label="Number of bases to plot from 5p and 3p ends (x-axis)" help="(Bottom plot in the Fragmisincorporation_plot.pdf file) // Must be inferior or equal to --length !">
                <validator type="in_range" min="0" exclude_min="true" message="--readplot must be > 0"/>
            </param>
            <param argument="--refplot" type="integer" min="0" value="10" label="Number of upstream and downstream nucleotides to plot for the composition plots" help="(Top four plots in the Fragmisincorporation_plot.pdf file) // Must be inferior or equal to --around !"/>
            <param argument="--title" type="text" optional="true" label="Plot title" help="Title to give to the plots in the pdf files"/>
            <param argument="--theme-bw" type="select" label="Graphical theme to use for the posterior prediction plot">
                <option value="" selected="true">gray (default)</option>
                <option value="--theme-bw">bw (--theme-bw)</option>
            </param>
        </section>

        <!--STATISTICS OPTIONS-->

        <section name="statistics" title="Statistical Estimation Options">
            <conditional name="stats">
                <param argument="--no-stats" type="select" label="Enable statistical estimations?">
                    <option value="" selected="true">Yes</option>
                    <option value="--no-stats">No</option>
                </param>
                <when value="">
                    <param argument="--rand" type="integer" min="0" value="30" label="Number of random starting points for the likelihood optimization"/>
                    <param argument="--burn" type="integer" min="0" value="10000" label="Number of burn-in iterations"/>
                    <param argument="--adjust" type="integer" min="0" value="10" label="Number of adjust proposal variance parameter iterations"/>
                    <param argument="--iter" type="integer" min="0" value="50000" label="Number of final MCMC iterations"/>
                    <conditional name="ends">
                        <param name="ends_to_process" type="select" label="Read ends to analyze for statistical estimation" help="(relates to --forward and --reverse options, default is to use both ends)">
                            <option value="" selected="true">5p and 3p ends</option>
                            <option value="--forward">5p ends only</option>
                            <option value="--reverse">3p ends only</option>
                        </param>
                        <when value="">
                            <param argument="--diff-hangs" type="boolean" checked="false" truevalue="--diff-hangs" falsevalue="" label="Overhangs are different in 5p and 3p?"/>
                        </when>
                        <when value="--forward"/>
                        <when value="--reverse"/>
                    </conditional>
                    <param argument="--seq-length" type="integer" min="1" value="12" label="Number of nucleotides (on both ends if both are used) to use for statistical estimations." help="Must be inferior or equal to the --length option !"/>
                    <param argument="--var-disp" type="boolean" checked="false" truevalue="--var-disp" falsevalue="" label="Variable dispersion in the overhangs?"/>
                    <param name="mutation_model" type="select" label="Substitution model to use in statistical estimation">
                        <option value="" selected="true">HKY85 model (Hasegawa, Kishino and Yano, default option)</option>
                        <option value="--jukes-cantor">Jukes-Cantor model (--jukes-cantor)</option>
                    </param>
                    <param name="nick_vector" type="select" label="Calculation of the nick vector" help="Default option is to estimate nick frequencies along the sequence using a GAM (Generalized Additive Model), followed by smoothing. GAM estimation will only be done if there are enough substitutions.">
                        <option value="" selected="true">Estimate using a GAM + smoothing</option>
                        <option value="--use-raw-nick-freq">Estimate using a GAM, no smoothing (--use-raw-nick-freq)</option>
                        <option value="--fix-nicks">Make it constant (--fix-nicks)</option>
                        <option value="--single-stranded">Single-stranded library protocol (makes the vector constant by filling with ones) (--single-stranded)</option>
                    </param>
                </when>
                <when value="--no-stats"/>
            </conditional>
        </section>

        <!--RESCALING OPTIONS-->

        <section name="rescale" title="Rescaling options">
            <conditional name="rescaling">
                <param argument="--rescale" type="select" label="Perform rescaling of base quality according to DNA damage level?" help="This is helpful for downstream analyses, as damaged bases create less noise.">
                    <option value="" selected="true">No</option>
                    <option value="--rescale">Yes</option>
                </param>
                <when value="--rescale">
                    <param argument="--rescale-length-5p" type="integer" min="0" optional="true" label="Number of bases to rescale at the 5p end" help="defaults to --seq-length if left empty. // Must be inferior or equal to --seq-length !"/>
                    <param argument="--rescale-length-3p" type="integer" min="0" optional="true" label="Number of bases to rescale at the 3p end" help="defaults to --seq-length if left empty. // Must be inferior or equal to --seq-length !"/>
                </when>
                <when value=""/>
            </conditional>
        </section>

    </inputs>

    <outputs>
        <!--RUNTIME LOG FILE-->
        <data name="runtime_log" format="txt" from_work_dir="mapDamage_results/Runtime_log.txt" label="${tool.name} on ${on_string}: Runtime_log.txt"/>

        <!--RESULT FILES THAT DESCRIBE DNA DAMAGE OBSERVED ON READS-->

        <collection name="damage_visualisation" type="list" label="${tool.name} on ${on_string}: Data description files and plots">
            <data name="dnacomp" format="txt" from_work_dir="mapDamage_results/dnacomp.txt" label="${tool.name} on ${on_string}: dnacomp.txt"/>
            <data name="misincorporation" format="txt" from_work_dir="mapDamage_results/misincorporation.txt" label="${tool.name} on ${on_string}: misincorporation.txt"/>
            <data name="5pCtoT_freq" format="txt" from_work_dir="mapDamage_results/5pCtoT_freq.txt" label="${tool.name} on ${on_string}: 5pCtoT_freq.txt"/>
            <data name="3pGtoA_freq" format="txt" from_work_dir="mapDamage_results/3pGtoA_freq.txt" label="${tool.name} on ${on_string}: 3pGtoA_freq.txt"/>
            <data name="Fragmisincorporation_plot" format="pdf" from_work_dir="mapDamage_results/Fragmisincorporation_plot.pdf" label="${tool.name} on ${on_string}: Fragmisincorporation_plot.pdf"/>
            <data name="lgdistribution" format="txt" from_work_dir="mapDamage_results/lgdistribution.txt" label="${tool.name} on ${on_string}: lgdistribution.txt"/>
            <data name="Length_plot" format="pdf" from_work_dir="mapDamage_results/Length_plot.pdf" label="${tool.name} on ${on_string}: Length_plot.pdf"/>
        </collection>

        <!--RESULT FILES THAT DESCRIBE STATISTICAL ESTIMATIONS OF DAMAGE PARAMETERS-->

        <collection name="statistical_estimation" type="list" label="${tool.name} on ${on_string}: Statistical estimation files and plots">
            <data name="dnacomp_genome" format="csv" from_work_dir="mapDamage_results/dnacomp_genome.csv" label="${tool.name} on ${on_string}: dnacomp_genome.csv"/>
            <data name="Stats_out_MCMC_iter_summ_stat" format="csv" from_work_dir="mapDamage_results/Stats_out_MCMC_iter_summ_stat.csv" label="${tool.name} on ${on_string}: Stats_out_MCMC_iter_summ_stat.csv"/>
            <data name="Stats_out_MCMC_hist" format="pdf" from_work_dir="mapDamage_results/Stats_out_MCMC_hist.pdf" label="${tool.name} on ${on_string}: Stats_out_MCMC_hist.pdf"/>
            <data name="Stats_out_MCMC_iter" format="csv" from_work_dir="mapDamage_results/Stats_out_MCMC_iter.csv" label="${tool.name} on ${on_string}: Stats_out_MCMC_iter.csv"/>
            <data name="Stats_out_MCMC_trace" format="pdf" from_work_dir="mapDamage_results/Stats_out_MCMC_trace.pdf" label="${tool.name} on ${on_string}: Stats_out_MCMC_trace.pdf"/>
            <data name="Stats_out_MCMC_correct_prob" format="csv" from_work_dir="mapDamage_results/Stats_out_MCMC_correct_prob.csv" label="${tool.name} on ${on_string}: Stats_out_MCMC_correct_prob.csv"/>
            <data name="Stats_out_MCMC_post_pred" format="pdf" from_work_dir="mapDamage_results/Stats_out_MCMC_post_pred.pdf" label="${tool.name} on ${on_string}: Stats_out_MCMC_post_pred.pdf"/>
            <filter>statistics['stats']['no_stats'] == ''</filter>
        </collection>

        <!--OPTIONAL ALIGNMENT OUTPUT FILES-->

        <data name="rescaled_bam" format="bam" from_work_dir="mapDamage_results/alignment.rescaled.bam" label="${tool.name} on ${on_string}: Rescaled .bam file">
            <filter>rescale['rescaling']['rescale'] == '--rescale'</filter>
        </data>
        <data name="fasta_alignment" format="fasta" from_work_dir="mapDamage_results/alignment.fasta" label="${tool.name} on ${on_string}: FASTA file of alignments">
            <filter>fasta</filter>
        </data>
    </outputs>

    <tests>

        <!--GENERAL TESTS-->
        <!--NB: only data_visualisation files are tested, as the statistical estimation process is stochastic and files will always be different-->
            <!--SAM input-->
        <test expect_num_outputs="17" expect_exit_code="0">
            <param name="sbam_file" value="test_align.sam" ftype="sam"/>
            <conditional name="reference">
                <param name="ref_source" value="history"/>
                <param name="history_reference" value="ref.fa" ftype="fasta"/>
            </conditional>
            <output_collection name="damage_visualisation" type="list">
                <element name="misincorporation" file="reference_misincorporation.txt" compare="diff" lines_diff="2"/>
                <element name="lgdistribution" file="reference_lgdistribution.txt" compare="diff" lines_diff="2"/>
            </output_collection>
        </test>

        <!--BAM input-->
        <test expect_num_outputs="17" expect_exit_code="0">
            <param name="sbam_file" value="test_align.bam" ftype="bam"/>
            <conditional name="reference">
                <param name="ref_source" value="history"/>
                <param name="history_reference" value="ref.fa" ftype="fasta"/>
            </conditional>
            <output_collection name="damage_visualisation" type="list">
                <element name="misincorporation" file="reference_misincorporation.txt" compare="diff" lines_diff="2"/>
                <element name="lgdistribution" file="reference_lgdistribution.txt" compare="diff" lines_diff="2"/>
            </output_collection>
        </test>

        <!--TEST TO VERIFY BUILT-IN REFERENCE GENOMES WORK AS INTENDED-->
        <test expect_num_outputs="9" expect_exit_code="0">
            <param name="sbam_file" value="test_align.bam" ftype="bam"/>
            <conditional name="reference">
                <param name="ref_source" value="builtin"/>
                <param name="builtin_reference" value="test"/>
            </conditional>
            <section name="statistics">
                <conditional name="stats">
                    <param name="no_stats" value="--no-stats"/>
                </conditional>
            </section>
            <output_collection name="damage_visualisation" type="list">
                <element name="misincorporation" file="reference_misincorporation.txt" compare="diff" lines_diff="2"/>
                <element name="lgdistribution" file="reference_lgdistribution.txt" compare="diff" lines_diff="2"/>
            </output_collection>
        </test>

        <!--TEST TO VERIFY no_stats OPTION WORKS AS INTENDED-->
        <test expect_num_outputs="9" expect_exit_code="0">
            <param name="sbam_file" value="test_align.sam" ftype="sam"/>
            <conditional name="reference">
                <param name="ref_source" value="history"/>
                <param name="history_reference" value="ref.fa" ftype="fasta"/>
            </conditional>
            <section name="statistics">
                <conditional name="stats">
                    <param name="no_stats" value="--no-stats"/>
                </conditional>
            </section>
            <output_collection name="damage_visualisation" type="list">
                <element name="misincorporation" file="reference_misincorporation.txt" compare="diff" lines_diff="2"/>
                <element name="lgdistribution" file="reference_lgdistribution.txt" compare="diff" lines_diff="2"/>
            </output_collection>
        </test>

        <!--TEST TO VERIFY fasta OPTION WORKS AS INTENDED-->
        <test expect_num_outputs="10" expect_exit_code="0">
            <param name="sbam_file" value="test_align.sam" ftype="sam"/>
            <conditional name="reference">
                <param name="ref_source" value="history"/>
                <param name="history_reference" value="ref.fa" ftype="fasta"/>
            </conditional>
            <section name="statistics">
                <conditional name="stats">
                    <param name="no_stats" value="--no-stats"/>
                </conditional>
            </section>
            <param name="fasta" value="true"/>
            <output_collection name="damage_visualisation" type="list">
                <element name="misincorporation" file="reference_misincorporation.txt" compare="diff" lines_diff="2"/>
                <element name="lgdistribution" file="reference_lgdistribution.txt" compare="diff" lines_diff="2"/>
            </output_collection>
            <output name="fasta_alignment" file="alignment.fa" compare="diff" count="1"/>
        </test>

        <!--TEST TO VERIFY rescale OPTION WORKS AS INTENDED-->
        <!--BAM file is not compared, as it depends on the stochastic statistical estimations, and may therefore differ (though perhaps not with this dataset)-->
        <test expect_num_outputs="18" expect_exit_code="0">
            <param name="sbam_file" value="test_align.sam" ftype="sam"/>
            <conditional name="reference">
                <param name="ref_source" value="history"/>
                <param name="history_reference" value="ref.fa" ftype="fasta"/>
            </conditional>
            <section name="rescale">
                <conditional name="rescaling">
                    <param name="rescale" value="--rescale"/>
                </conditional>
            </section>
            <output_collection name="damage_visualisation" type="list">
                <element name="misincorporation" file="reference_misincorporation.txt" compare="diff" lines_diff="2"/>
                <element name="lgdistribution" file="reference_lgdistribution.txt" compare="diff" lines_diff="2"/>
            </output_collection>
        </test>
    </tests>

    <help><![CDATA[
    **Overview**:

mapDamage is a computational framework which allows for authentication of ancient DNA NGS reads by tracking and quantifying damage patterns which are specific to ancient DNA molecules.

The following damage patterns can be studied thanks to mapDamage's plots and Bayesian statistical estimations:

	* Fragment length
	* Purine enrichment before strand breaks
	* Cytosine deaminations and resulting misincorporations (C → T and G → A) (in single-stranded and double-stranded context)
	* Nicking
	* Overhang lengths

This help is divided into subsections: Inputs, Outputs, Parameters, and General Usage. For a quick read, jump down to the 'General Usage' section!

-----

    **Inputs**:

mapDamage only needs two input files:

	* An alignment map file, either in BAM or SAM format.

	* A reference sequence, in FASTA format. Can either be uploaded to Galaxy or chosen from a list of built-in references (no uploading required).

-----

    **Outputs**:

The most commonly used outputs are listed in the 'General Usage' section below.

mapDamage has plenty of output files, which have been grouped into Data Collections in order to improve clarity.

* **Runtime_log.txt**: Contains general information about the command line which was executed by Galaxy, as well as the time taken by each task.

* Outputs in the **'Damage Visualisation' collection**: These outputs (text files and associated PDF plots) describe the damage patterns observed in your input BAM / SAM. No statistical estimations were involved in their generation.

	- **dnacomp.txt**: Contains the counts of each base (ATGC) across all reads, for the first X bases from the end in question (X is given by --length option) as well as the Y bases surrounding the end in question (Y is given by the --around option). This is broken down by end (3p/5p) and by strand (+/-), and by chromosome/contig.

	- **misincorporation.txt**: Contains the counts of each type of mutation (substitutions, deletions, insertions, soft-clipping) across all reads, for the first X bases from the end in question (X is given by the --length option). This is broken down by end (3p/5p), by strand (+/-), and by chromosome/contig.

	- **3pGtoA_freq.txt**:  Contains the frequencies of G → A substitutions across all reads, for the first N bases from the 3p end (N is given by --readplot).

	- **5pCtoT_freq.txt**:  Contains the frequencies of C → T substitutions across all reads, for the first N bases from the 5p end (N is given by --readplot).

	- **Fragmisincorporation_plot.pdf**: Contains plots of the above data: One frequency plot per nucleotide, as well as a plot describing the distribution of C → T and G → A substitutions

	- **lgdistribution.txt**: Contains the distribution of reads according to strand on which they map (+/-) and length.

	- **Length_plot.pdf**: Contains plots of the above data.

* Outputs in the **'Statistical Estimations' collection**: These outputs describe the Bayesian statistical estimation process as well as its results.

	- **dnacomp_genome.csv**: Contains whole-genome base frequencies, as calculated by seqtk.

	- **Stats_out_MCMC_iter_summ_stat.csv**: Contains mean value, standard deviation, acceptance ratio and posterior distribution for the following parameters:

		* θ  : difference rate between reference and sample that is not due to DNA damage.
		* ρ  : transversion / transition bias.
		* δd : rate of Cytosine deamination in a double-stranded context.
		* δs : rate of Cytosine deamination in a single-stranded context.
		* λ  : probability of terminating an overhang.
		* The model's log-likelihood.

	- **Stats_out_MCMC_hist.pdf**: Contains plots of the posterior distribution of the 6 parameters described above.

	- **Stats_out_MCMC_iter.csv**: Contains the mean values for each of the 6 parameters described above at every MCMC iteration.

	- **Stats_out_MCMC_trace.pdf**: Contains trace plots derived from the previous file.

	- **Stats_out_MCMC_correct_prob.csv**: Contains posterior probabilities that C → T and G → A substitutions are due to DNA damage, for the first N bases from both 3p and 5p ends (N is given by --seq-length)

	- **Stats_out_MCMC_post_pred.pdf**: Contains plots derived from the previous file, as well as predictive intervals for these probabilities.


* **alignment.fasta** (if --fasta option was set): FASTA file containing alignments as specified by the BAM / SAM file. FASTA file also contains the nucleotides surrounding the reads as defined by the --around parameter.

* **alignment.rescaled.bam** (if --rescale option was set): BAM file containing a rescaled version of the alignment: the --rescale parameter leads to a re-evaluation of base quality based on misincorporation probability, which can decrease the noise created by these misincorporations in downstream analyses.

-----

    **Options**:

mapDamage has many parameters, which have been grouped into sections for convenience.

Here, the objective is to provide a guide on the most important tool features, and only a few key parameters are therefore detailed.

More details concerning each option can be found:

	- on the mapDamage GitHub.
	- in the descriptions given below the options.

Some options may lack a detailed explanation (statistics related options in particular). In this case, please refer to the citations below and / or to the source code, available on the mapDamage GitHub.

	* **--no-stats**: Do not perform statistical estimation of damage parameters. This can be very useful if you only plan on looking at the *Fragmisincorporation.pdf* plot in order to see if the damage patterns are present, as it tends to vastly improve the run-time.
	* **--rescale**: Produces a new BAM alignment file where base qualities have been re-evaluated according to damage probability. Please note that this can only be done if statistical estimations are performed. Useful for reducing noise in downstream analyses: aDNA related misincorporations (C → T, G → A) are no longer falsely flagged as SNPs.
	* **--merge-reference-sequence**: Merges results for the different sequences (chromosomes or contigs) in your reference FASTA file. Instead of one table per reference sequence in the dnacomp.txt and misincorporation.txt files, there is one table describing every sequence. This greatly reduces the size of these files (size is more or less divided by the number of reference sequences)

	* Options regarding **specific library preparation protocols**: By default, mapDamage assumes that you are working with merged paired-end reads, from a double-stranded library protocol. If that is not the case, the tool authors have made recommendations, which you can check out  in the supplementary data of Jónsson et al. (2013).

	    - For **single-stranded libraries**:

		    * Calculation of the nick vector: single-stranded library protocol (**--single-stranded**). The authors recommend this option for treating single-stranded library data, and remark that the user should see elevated C → T substitution frequencies at both ends in the Stats_out_MCMC_post_pred.pdf plot.

	    - For **single-end libraries**:

		    * Read ends to analyze for statistical estimation: 5p ends only (**--forward**). This forces mapDamage to only analyze the 5p ends of reads.
		    * Overhangs are different in 5p and 3p?: Yes (**--diff-hangs**). This option allows different mean lengths for the two overhangs, which can be another way to account for single-end data.

-----

    **General usage**:

	(This description is also based on the supplementary data of Jónsson et al. (2013))

    **Launching mapDamage**

	For a simple usage of mapDamage, the user only needs to provide the two input files (alignment file and reference file), and press the 'Run tool' button.

    If downstream SNP analyses are in order, then one may want to use the rescaling feature, which will reduce the noise induced by cytosine deamination  in these analyses.

    Statistical estimation may be disabled to improve run-time, if the user doesn't intend to make use of it.

    **Analyzing the outputs**

	The first outputs one should have a look at are the *Fragmisincorporation_plot.pdf* and the *Length_plot.pdf* files. One can check if the expected damage is present, and if there are any problems with the dataset.

	If statistical estimations were run, the user should check if the proposed model is a good fit: do the empirical substitution frequencies (*Fragmisincorporation.pdf* plot) fall within the prediction intervals (*Stats_out_MCMC_post_pred.pdf* plot) ?

    The fit can also be assessed by looking at the *Stats_out_MCMC_trace.pdf* plot, to check if an equilibrium was reached over the course of the iterations. Finally, one can have a look at the acceptance ratio values in the *Stats_out_MCMC_iter_summ_stat.csv* file (3rd row), which should be between 0.1 and 0.3.

    If the fit is not correct, it may be improved by increasing the values of the --rand, --burn, --adjust and --iter options.

	If the fit looks correct, the rescaled BAM file can be confidently used, and the values of the damage parameters described in the *Stats_out_MCMC_iter_summ_stat.csv* may be used for comparison across samples.
    ]]></help>

    <citations>
        <citation type="doi">10.1093/bioinformatics/btt193</citation>
        <citation type="doi">10.1093/bioinformatics/btr347</citation>
    </citations>

</tool>
author	iuc
date	Tue, 22 Jul 2025 09:02:06 +0000
parents	761b6fdcac6e
children