diff facets_analysis.xml @ 7:86bcdc94b008 draft

planemo upload for repository https://github.com/ARTbio/tools-artbio/tree/main/tools/facets commit 2da49e9385ddce5c74e077c81a52ff1ea4131b81
author artbio
date Wed, 08 Oct 2025 17:41:18 +0000
parents 625038b7d764
children
line wrap: on
line diff
--- a/facets_analysis.xml	Mon Oct 06 15:50:12 2025 +0000
+++ b/facets_analysis.xml	Wed Oct 08 17:41:18 2025 +0000
@@ -23,6 +23,8 @@
                 --merge_gap_abs $merging.max_gap_abs
                 --merge_gap_rel $merging.max_gap_rel
             #end if
+            --vcf_min_nhet $filtering.vcf_min_nhet
+            --vcf_min_num_mark $filtering.vcf_min_num_mark
     ]]></command>
     <inputs>
         <param name="pileup" type="data" format="tabular.gz" label="FACETS Pileup File" help="Output from the 'SNP Pileup for FACETS' tool."/>
@@ -50,6 +52,10 @@
                 <param name="max_gap_rel" type="float" value="0.5" label="Relative maximum gap to merge (fraction)" help="Maximum relative distance, as a fraction of the average size of the two segments."/>
             </when>
         </conditional>
+        <section name="filtering" title="VCF Output Filtering" expanded="false">
+            <param name="vcf_min_nhet" type="integer" value="2" label="Minimum heterozygous SNPs for VCF output" help="Post-filter to remove final segments with fewer than this many heterozygous SNPs."/>
+            <param name="vcf_min_num_mark" type="integer" value="3" label="Minimum total markers for VCF output" help="Post-filter to remove final segments with fewer than this many total markers (SNPs). Helps remove SVLEN=0 artifacts."/>
+        </section>
     </inputs>
     <outputs>
         <data name="output_seg" format="tsv" label="FACETS Segmentation on ${on_string}"/>
@@ -69,14 +75,13 @@
             <output name="output_vcf" file="test_sample_01.cnv.vcf" ftype="vcf" lines_diff="2" />
         </test>
     </tests>
-<help><![CDATA[
+    <help><![CDATA[
             **What it does**
 
             This tool runs the `FACETS` R package to perform allele-specific copy number
             and clonal heterogeneity analysis. It takes the compressed pileup file
             generated by the "SNP Pileup for FACETS" tool as its primary input and
-            produces a set of standard FACETS outputs, including segmentation calls,
-            purity/ploidy estimates, plots, and a VCF file summarizing the CNV events.
+            produces a set of standard FACETS outputs.
 
             ---
 
@@ -85,45 +90,32 @@
             These parameters control the core of the FACETS segmentation algorithm.
 
             - **Critical value for segmentation (cval):** This is the most important
-              parameter for controlling the sensitivity of the segmentation. A *higher*
-              value (e.g., 200-800) will result in fewer segments and is generally
-              recommended for high-density data like Whole Genome Sequencing (WGS).
-              A *lower* value (e.g., 50-150) increases sensitivity, resulting in more
-              segments, and is more suitable for sparser data like Whole Exome
-              Sequencing (WES).
+              parameter for controlling the sensitivity. A *higher* value (e.g., 200-800)
+              results in fewer segments (less sensitive) and is recommended for
+              high-density data (WGS). A *lower* value (e.g., 50-150) increases
+              sensitivity and is more suitable for sparser data (WES).
 
             - **Minimum number of heterozygous SNPs (min.nhet):** This is a quality
-              filter. After segmentation, any segment that is supported by fewer
-              heterozygous SNPs than this threshold will be discarded. This helps
-              to remove unreliable, small segments.
+              filter. Segments supported by fewer heterozygous SNPs than this
+              threshold will be discarded during the initial segmentation pass.
 
-            - **SNP neighbourhood size (snp.nbhd):** This parameter defines the genomic
-              window (in bp) around a SNP used for local read depth normalization.
-              The default value is generally appropriate.
+            - **SNP neighbourhood size (snp.nbhd):** Defines the genomic window (in bp)
+              around a SNP used for local read depth normalization.
 
             ---
 
-            **Advanced VCF Post-processing: Merging Segments**
+            **Advanced VCF Post-processing**
 
-            You can optionally enable a post-processing step to merge adjacent CNV
-            segments in the output VCF.
+            You can optionally enable post-processing steps to refine the final VCF.
 
-            *Why is this useful?*
-            Segmentation algorithms can sometimes split a single, large biological event
-            (e.g., a 10 Mb deletion) into several smaller, adjacent segments with the
-            same copy number state. This feature attempts to correct this by merging
-            these segments back together, providing a cleaner and more biologically
-            accurate representation of the CNV landscape.
+            - **Merging Segments:** This option merges adjacent CNV segments that likely
+              represent a single biological event, providing a cleaner and more
+              biologically accurate output.
 
-            The merging is controlled by an algorithm using two thresholds:
-
-            - **Absolute maximum gap:** The maximum distance in base pairs allowed
-              between two segments to even consider them for merging. This acts as a
-              safeguard.
-            - **Relative maximum gap:** The maximum distance allowed, expressed as a
-              *fraction* of the average size of the two segments. This allows large
-              gaps between large segments, but not between small ones, trying to mimic
-              how a human expert would interpret the data.
+            - **Filtering Segments:** This option removes low-quality or artefactual
+              segments based on the number of SNPs supporting them. This is recommended
+              as FACETS can sometimes report micro-segments that are not biologically
+              relevant.
 
             ---
 
@@ -132,18 +124,37 @@
             - **Segmentation file (TSV):** The raw segment data with genomic coordinates
               and their associated copy number (TCN, LCN).
             - **Summary file:** The main estimated parameters like purity, ploidy, etc.
+            - **Plots file (PNG):** A genome-wide visualization of the copy number and
+              allelic imbalance results across all chromosomes.
+            - **Spider Plot (PNG):** The most important **diagnostic plot** for assessing
+              the quality of the FACETS fit. See detailed explanation below.
             - **CNV calls file (VCF):** A summary of the detected copy number events in
-              a standard VCF format, suitable for downstream analysis.
-            - **Plots file (PNG):** An enhanced visualization of the genome-wide results.
-            - **Spider Plot (PNG):** This is the most important **diagnostic plot** for
-              assessing the quality of the FACETS fit.
-              On this plot (generated by the `logRlogORspider` function), each
-              **circle** is a genomic segment from your data. The **curves** (labeled
-              `2-1`, `1-0`, etc.) represent the theoretical positions for integer copy
-              number states given the estimated purity and ploidy. A high-confidence
-              result is achieved when your data (the circles) align closely with these
-              theoretical curves. For a detailed interpretation, please refer to the
-              original FACETS publication: Shen and Seshan, *NAR*, 2016.
-        ]]></help>
+              a standard VCF format for structural variants. The `ALT` column contains
+              symbolic alleles (`<DEL>`, `<DUP>`). All FACETS-specific details are in
+              the `INFO` field:
+
+              ``SVTYPE``
+                Type of variant (e.g., DEL, DUP).
+              ``EVENT``
+                FACETS classification (e.g., HOMOZYG_DEL, CN_LOH).
+              ``TCN``
+                Total Copy Number.
+              ``LCN``
+                Lesser Copy Number.
+              ``NUM_MARK``
+                Total number of SNPs in the segment.
+              ``NHET``
+                Number of heterozygous SNPs in the segment.
+
+            **Interpreting the Spider Plot**
+
+            On this plot (generated by the `logRlogORspider` function), each
+            **circle** is a genomic segment from your data. The **curves** (labeled
+            `2-1`, `1-0`, etc.) represent the theoretical positions for integer copy
+            number states. A high-confidence result is achieved when your data (the
+            circles) align closely with these curves. For details, refer to the
+            original FACETS publication: Shen and Seshan, *NAR*, 2016.
+
+    ]]></help>
     <expand macro="citations"/>
 </tool>