Mercurial > repos > bgruening > hicexplorer_hiccorrectmatrix

<tool id="hicexplorer_hiccorrectmatrix" name="@BINARY@" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@">
    <description>run a Hi-C matrix correction algorithm</description>
    <macros>
        <token name="@BINARY@">hicCorrectMatrix</token>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <command detect_errors="exit_code"><![CDATA[
        cp '$matrix_h5_cooler' 'matrix.$matrix_h5_cooler.ext' &&

        @BINARY@
            $mode.mode_selector
            --matrix 'matrix.$matrix_h5_cooler.ext'

            ## special: --chromosomes is optional, but if given needs at least one argument
            #set chroms = ' '.join([ '\'' + str($var.chromosome) + '\'' for $var in $chromosomes ])
            #if chroms
                --chromosomes $chroms
            #end if

            #if $mode.mode_selector == 'correct':
                #if $mode.correctionMethod.correctionMethod_selector == 'ice':
                    --correctionMethod ICE
                    --iterNum $mode.correctionMethod.iterNum


                    #if $mode.correctionMethod.filterThreshold_low and $mode.correctionMethod.filterThreshold_large:
                        --filterThreshold $mode.correctionMethod.filterThreshold_low $mode.correctionMethod.filterThreshold_large
                    #end if

                    #if $mode.correctionMethod.inflationCutoff:
                        --inflationCutoff $mode.correctionMethod.inflationCutoff
                    #end if

                    #if $mode.correctionMethod.transCutoff:
                        --transCutoff $mode.correctionMethod.transCutoff
                    #end if

                    #if $mode.correctionMethod.sequencedCountCutoff:
                        --sequencedCountCutoff $mode.correctionMethod.sequencedCountCutoff
                    #end if

                    $mode.correctionMethod.skipDiagonal
                    $mode.correctionMethod.perchr
                #else:
                    --correctionMethod KR
                #end if
                --outFileName matrix.$matrix_h5_cooler.ext
            #elif $mode.mode_selector == 'diagnostic_plot':
                --plotName diagnostic_plot.png
                ##--outMatrixFile corrected_matrix.npz.h5
                #if $mode.xMax:
                    --xMax $mode.xMax
                #end if
                $mode.perchr
            #end if
        #if $mode.mode_selector == 'correct':
            && mv matrix.$matrix_h5_cooler.ext '$outFileName'
        #end if
]]>
    </command>
    <inputs>
        <expand macro="matrix_h5_cooler_macro" />
        <conditional name="mode">
            <param name="mode_selector" type="select" label="Mode">
                <option value="diagnostic_plot">Diagnostic plot</option>
                <option value="correct">Correct matrix</option>
            </param>
            <when value="diagnostic_plot">
                <expand macro="xMax" />
                <param argument="--perchr" type="boolean" truevalue="--perchr" falsevalue="" checked="false" label="Compute statistics for each chromosome separately" />
            </when>
            <when value="correct">
                <conditional name="correctionMethod">
                    <param name="correctionMethod_selector" type="select" label="Correction method">
                        <option value="ice">Iterative_correction (Imakaev)</option>
                        <option value="kr">Knights-Ruiz</option>
                    </param>

                    <when value="ice">
                        <param argument="--iterNum" type="integer" optional="true" value="500" label="Number of iterations" />

                        <param argument="--inflationCutoff" type="float" optional="true" label="Inflation cutoff" value="" help="Value corresponding to the maximum number of times a bin can be scaled up during the iterative correction.
                            For example, a inflationCutoff of 3 will filter out all bins that were expanded 3 times or more during the iterative correction." />

                        <param argument="--transCutoff" type="float" optional="true" label="Trans region cutoff" value="" help="Clip high counts in the top -transcut trans regions (i.e. between chromosomes). A usual value is 0.05." />

                        <param argument="--sequencedCountCutoff" optional="true" type="float" label="Sequenced count cutoff" help="Each bin receives a value indicating the fraction that is covered by reads.
                                A cutoff of 0.5 will discard all those bins that have less than half of the bin covered." />

                        <param argument="--skipDiagonal" type="boolean" truevalue="--skipDiagonal" falsevalue="" checked="false" label="Skip diagonal counts" />

                        <param argument="--perchr" type="boolean" truevalue="--perchr" falsevalue="" checked="false" label="Normalize each chromosome separately" />
                        <expand macro="filterThreshold" />
                    </when>
                    <when value="kr">
                    </when>
                </conditional>
            </when>
        </conditional>

        <repeat name="chromosomes" min="0" title="Include chromosomes" help="List of chromosomes to be included in the iterative correction.
                    The order of the given chromosomes will be kept for the resulting corrected matrix">
            <param name="chromosome" type="text" value="" label="chromosome (one per field)">
                <validator type="empty_field" />
            </param>
        </repeat>

    </inputs>
    <outputs>
        <data name="outFileName" format="cool">
            <change_format>
                <when input_dataset="matrix_h5_cooler" attribute="ext" value="h5" format="h5" />
            </change_format>
            <filter>mode['mode_selector'] == "correct"</filter>
        </data>
        <data name="diagnostic_plot" from_work_dir="diagnostic_plot.png" format="png" label="${tool.name} on ${on_string}: Diagnostic plot">
            <filter>mode['mode_selector'] == "diagnostic_plot"</filter>
        </data>
    </outputs>
    <tests>
        <test expect_num_outputs="1">
            <param name="matrix_h5_cooler" value="small_test_matrix.h5" />
            <param name="mode_selector" value="correct" />
            <param name="correctionMethod_selector" value="ice" />
            <repeat name="chromosomes">
                <param name="chromosome" value="chrUextra" />
            </repeat>
            <repeat name="chromosomes">
                <param name="chromosome" value="chr3LHet" />
            </repeat>
            <param name="filterThreshold_low" value="-2.0" />
            <param name="filterThreshold_large" value="4" />
            <output name="outFileName" ftype="h5">
                <assert_contents>
                    <has_h5_keys keys="correction_factors,intervals,matrix,nan_bins" />
                </assert_contents>
            </output>
        </test>
        <test expect_num_outputs="1">
            <param name="matrix_h5_cooler" value="small_test_matrix.h5" />
            <param name="mode_selector" value="correct" />
            <param name="correctionMethod_selector" value="kr" />
            <output name="outFileName" ftype="h5">
                <assert_contents>
                    <has_h5_keys keys="correction_factors,intervals,matrix" />
                </assert_contents>
            </output>
        </test>
        <test expect_num_outputs="1">
            <param name="matrix_h5_cooler" value="small_test_matrix.cool" />
            <param name="mode_selector" value="correct" />
            <param name="correctionMethod_selector" value="kr" />
            <output name="outFileName" ftype="cool">
                <assert_contents>
                    <has_h5_keys keys="bins,chroms,indexes,pixels" />
                </assert_contents>
            </output>
        </test>
        <test expect_num_outputs="1">
            <param name="matrix_h5_cooler" value="small_test_matrix.h5" />

            <param name="mode_selector" value="correct" />
            <param name="correctionMethod_selector" value="ice" />
            <repeat name="chromosomes">
                <param name="chromosome" value="chrUextra" />
            </repeat>
            <repeat name="chromosomes">
                <param name="chromosome" value="chr3LHet" />
            </repeat>
            <param name="filterThreshold_low" value="-2.0" />
            <param name="filterThreshold_large" value="4" />
            <output name="outFileName" ftype="h5">
                <assert_contents>
                    <has_h5_keys keys="correction_factors,intervals,matrix,nan_bins" />
                </assert_contents>
            </output>
        </test>
        <test expect_num_outputs="1">
            <param name="matrix_h5_cooler" value="small_test_matrix.h5" />
            <param name="mode_selector" value="diagnostic_plot" />
            <repeat name="chromosomes">
                <param name="chromosome" value="chrUextra" />
            </repeat>
            <repeat name="chromosomes">
                <param name="chromosome" value="chr3LHet" />
            </repeat>
            <output name="diagnostic_plot" file="hicCorrectMatrix/diagnostic_plot.png" ftype="png" compare="sim_size" />
        </test>
    </tests>
    <help><![CDATA[

Hi-C contact matrix correction
==============================

**hicCorrectMatrix** runs Imakaev's iterative correction, described in `Imakaev et al. (2012)`_, or Knight-Ruiz correction over a Hi-C matrix. For the matrix correction to be efficient,
it is important to remove the unassembled scaffolds (e.g. `NT_`), mitochondrial DNA and Y chromosome and keep only full length
chromosomes, as scaffolds create problems with matrix correction. Therefore we use the chromosome names (1-19, X, Y) here.

**Important**: Use ‘chr1 chr2 chr3 etc.’ if your genome index uses chromosome names with the ‘chr’ prefix.

Also, for the method to work correctly, bins with zero reads assigned to them should be removed as they can not be corrected. Also, bins with low number of reads should be removed, otherwise, during the correction step, the counts associated with those bins will be amplified (usually, zero and low coverage bins tend contain repetitive regions). Bins with extremely high number of reads can also be removed from the correction as they may represent copy number variations.

To aid in the identification of bins with low and high read coverage, the ``diagnostic plot`` function of **hicCorrectMatrix** must be used.

Indeed, **hicCorrectMatrix** works in two steps:

  - **Diagnostic plot**: First a histogram containing the sum of contact per bin (row sum) is produced. This plot needs to be inspected to decide the best threshold for removing bins with lower number of reads.

  - **Correct**: The second step removes the bins outside of the defined thresholds and perfroms the iterative correction.

_________________

Usage
-----

This tool must be used on uncorrected matrices at restriction enzyme resolution or with merged bins (``hicMergeMatrixBins``).

_________________

Output
------

Diagnostic plot
_______________


The goal of the diagnostic plot is to help the user decide on a cutoff threshold that will ignore Hi-C matrix
bins with few reads assigned to them. The plot is a histogram of the total number of Hi-C reads per matrix bin.
A secondary scale based on the mean absolute deviation score, is shown on top of the figure.
This secondary scale aims to offer 'normalized' values that are comparable across samples
independently of the sequencing depth and the fraction of usable Hi-C reads. In all samples that we have studied,
the histogram follows a bimodal distribution where the first peak is for bins with zero reads which usually occur
at repetitive regions. Other low scoring bins tend to be close to repetitive regions.
Also, low scoring bins can be caused by absence of a restriction site in the bin or because the restriction
site is present but the restriction enzyme did not cut. The valley between the two peaks in the
histogram is set by default as cutoff threshold.
However, it is important to revise this as in some cases the selected value could not be correct.


.. image:: $PATH_TO_IMAGES/diagnostic_plot.png
    :width: 50%

On the example plot above, a user can then use the lower threshold defined by the Median Absolute Deviation (MAD) method (black bold bar), or define its own threshold based on the contacts distribution.

Correct
_______

Run the iterative correction and outputs the corrected matrix. This matrix can then be used with all downstream analysis tools such as ``hicPlotMatrix``, ``pyGenomeTracks``, ``hicPlotViewpoint``, ``hicAggregateContacts`` for **visualization of Hi-C data**, ``hicCorrelate``, ``hicPlotDistVsCounts``, ``hicTransform``, ``hicFindTADs``, ``hicPCA`` **for data and scores computation on Hi-C data**.

It is noteworthy that ``hicSumMatrices`` and ``hicMergeMatrixBins`` **must be performed on uncorrected matrices**.

_________________

| For more information about HiCExplorer please consider our documentation on readthedocs.io_

.. _readthedocs.io: http://hicexplorer.readthedocs.io/en/latest/index.html
.. _`Imakaev et al. (2012)`: https://doi.org/10.1038/nmeth.2148
]]>    </help>
    <expand macro="citations" />
</tool>
author	iuc
date	Mon, 04 Nov 2024 23:43:41 +0000
parents	62b720dceb56
children