diff hicCorrectMatrix.xml @ 10:bfa1c014f64a draft

planemo upload for repository https://github.com/maxplanck-ie/HiCExplorer/tree/master/galaxy/wrapper/ commit dddc0b9035b8edadfd45d74b01aeca245c2725d7
author iuc
date Fri, 27 Apr 2018 08:38:17 -0400
parents ac80bd0a96ca
children 92fc291ceb1a
line wrap: on
line diff
--- a/hicCorrectMatrix.xml	Fri Apr 27 03:29:59 2018 -0400
+++ b/hicCorrectMatrix.xml	Fri Apr 27 08:38:17 2018 -0400
@@ -189,28 +189,24 @@
 Diagnostic plot
 _______________
 
-The diagnostic plot consists of a bar plot of the contacts coverage per bins size together with the
-modified z-score based on the Median Absolute Deviation (MAD) method.
 
-See Boris Iglewicz and David Hoaglin 1993, Volume 16:
-How to Detect and Handle Outliers The ASQC Basic References in Quality Control: Statistical Techniques,
-Edward F. Mykytka, Ph.D., Editor.
-
-Using this diagnostic plot, a user can decide if values
-with a too low (and/or too high) number of contacts in respect to their genomic distance should
-be removed from the data before the correction applies.
-
-Moreover, the shown distribution should be a Gaussian bell. If it doesn’t follow a Gaussian distribution
-this is an indicator that the used data is of bad quality or that the used contact matrix
-is maybe not the one that should be used. It can happen that users select for example a merge
-matrix with a lower resolution that was previously needed for plotting. In such cases the
-diagnostic plot helps to detect this and prevent the user from running the analysis on a wrong dataset.
+The goal of the diagnostic plot is to help the user decide on a cutoff threshold that will ignore Hi-C matrix
+bins with few reads assigned to them. The plot is a histogram of the total number of Hi-C reads per matrix bin.
+A secondary scale based on the mean absolute deviation score, is shown on top of the figure.
+This secondary scale aims to offer 'normalized' values that are comparable across samples
+independently of the sequencing depth and the fraction of usable Hi-C reads. In all samples that we have studied,
+the histogram follows a bimodal distribution where the first peak is for bins with zero reads which usually occur
+at repetitive regions. Other low scoring bins tend to be close to repetitive regions.
+Also, low scoring bins can be caused by absence of a restriction site in the bin or because the restriction
+site is present but the restriction enzyme did not cut. The valley between the two peaks in the
+histogram is set by default as cutoff threshold.
+However, it is important to revise this as in some cases the selected value could not be correct. 
 
 
 .. image:: $PATH_TO_IMAGES/diagnostic_plot.png
     :width: 50%
 
-On the example plot above, a user can then use the lower threshold defined by the MAD method (black bold bar), or define its own threshold based on the contacts distribution.
+On the example plot above, a user can then use the lower threshold defined by the Median Absolute Deviation (MAD) method (black bold bar), or define its own threshold based on the contacts distribution.
 
 Correct
 _______