Mercurial > repos > iuc > raceid_clustering

<tool id="raceid_clustering" name="Clustering using RaceID" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@" >
    <description>performs clustering, outlier detection, dimensional reduction</description>
    <macros>
        <import>macros.xml</import>
        <import>macros_cheetah.xml</import>
    </macros>
    <expand macro="bio_tools"/>
    <expand macro="requirements" />
    <expand macro="version_command_config" prog="cluster.R" cheetah="CLUSTER_CHEETAH" out="&#38;&#62; '$outlog'" />

    <inputs>
        <param name="inputrds" type="data" format="rds" label="Input RaceID RDS" help="Requires the RDS output from the normalisation stage" />
        <section name="clust" title="Clustering" expanded="true" >
            <!-- CompDist -->
            <param name="metric" type="select" label="Distance Metric" >
                <option value="pearson" selected="true" >Pearson</option>
                <option value="spearman">Spearman</option>
                <option value="logpearson">Log Pearson</option>
                <option value="euclidean">Euclidean</option>
            </param>
            <!-- ClustExp -->
            <param name="funcluster" type="select" label="Clustering method" >
                <option value="kmedoids" selected="true" >K-medoids</option>
                <option value="kmeans">K-means</option>
                <option value="hclust">H-Clust</option>
            </param>
            <expand macro="use_defaults_no" >
                <!-- CompDist -->
                <param name="fselect" type="boolean" value="true" label="Perform feature selection" />
                <param name="knn" type="integer" min="0" optional="true" label="KNN" help="Number of nearest neighbours for imputing gene expression" /><!-- 0: NULL -->
                <!-- ClustExp -->
                <param name="sat" type="boolean" checked="true" label="Saturation-based clustering?" help="Determine number of clusters on saturation point of the mean within-cluster dispersion as a function of the cluster number." />
                <param name="clustnr" type="integer" min="0" value="30" label="Max number of clusters using Saturation-by-mean" help="Max number of clusters for the derivation of the cluster number by the saturation of mean within-cluster-dispersion." />
                <param name="samp" type="integer" min="0" optional="true" label="Sample random number of cells" help="Number of random sample of cells used for the inference of cluster number and Jaccard similarity" /><!-- 0:NULL -->
                <param name="cln" type="integer" min="0" optional="true" label="Number of clusters" /><!-- 0:Null -->
                <param name="bootnr" type="integer" min="0" value="50" label="Number of booststrapping runs" />
            </expand>
        </section>
        <section name="outlier" title="Outliers" expanded="true" >
            <!-- Find Outliers -->
            <param name="outminc" type="integer" min="0" value="5" label="Minimum Transcripts" help="minimal transcript count of a gene in a clusters to be tested for being an outlier gene" />
            <param name="outlg" type="integer" min="1" value="2" label="Minimum Genes" help="Minimum number of outlier genes required for being an outlier cell" />
            <!-- RFCorrect -->
            <param name="final" type="boolean" value="true" label="Plot Final Clusters?" help="Reclassification of cell types using out-of-bag analysis is performed based on the final clusters after outlier identification. If 'FALSE', then the cluster partition prior to outlier identification is used for reclassification." />
            <expand macro="use_defaults_no" >
                <!-- Find Outliers -->
                <param name="probthr" type="float" min="0" value="0.001" label="Outlier Probability Threshold" help="Probability threshold for the above specified minimum number of genes to be an outlier cell. This probability is computed from a negative binomial background model of expression in a cluster" />
                <param name="outdistquant" type="float" min="0" max="1" value="0.95" label="Outlier Distance Quantile" help="Outlier cells are merged to outlier clusters if their distance smaller than the outdistquant-quantile of the distance distribution of pairs of cells in the orginal clusters after outlier removal" />
                <!-- RFCorrect -->
                <param name="nbtree" type="integer" optional="true" label="Number of trees to be built" /><!-- 0:Null -->
                <param name="nbfactor" type="integer" min="0" value="5" label="Tree Factor" help="Number of trees based on the number of cells multiplied by this factor. Effective only if the number of trees parameter is set to 0" />
                <!-- <param name="rfseed" type="integer" value="12345" label="Random Seed" /> -->
            </expand>
        </section>
        <section name="tsne" title="tSNE, FR and UMAP" expanded="true" >
            <!-- CompTSNE -->
            <param name="perplexity" type="integer" min="0" value="30" label="Perplexity" help="Perplexity of the t-SNE map" />
            <!-- CompFR -->
            <param name="knn" type="integer" min="1" value="10" label="KNN" help="Number of nearest neighbours used for the inference of the Fruchterman-Rheingold layout" />
            <!-- CompUMAP -->
            <param name="umap_nn" type="integer" min="1" value="15" label="N Neighbours" help="Number of nearest neighbours used for the UMAP connectivity" />
            <expand macro="use_defaults_no" >
                <!-- CompTSNE -->
                <param name="initial_cmd" type="boolean" checked="true" label="tSNE map initialised by classical multidimensional scaling" />
                <param name="rseed_tsne" type="integer" value="15555" label="Random Seed (tSNE)" />
                <!-- CompFR -->
                <param name="rseed_fr" type="integer" min="0" value="15555" label="Random Seed (FR)" />
                <!-- CompUMAP -->
                <param name="umap_epochs" type="integer" min="1" value="200" label="Number Epochs (UMAP)" />
                <param name="umap_min_dist" type="float" min="0" value="0.1" label="Min Dist (UMAP)" />
            </expand>
        </section>
        <section name="extra" title="Extra Parameters" expanded="false" >
            <param name="tablelim" type="integer" min="1" value="25" label="Table Limit" help="Top N genes to print per cluster" />
            <param name="plotlim" type="integer" min="1" value="10" label="Plot Limit" help="Top N genes to plot. Must be less than or equal to the Table Limit" />
            <param name="foldchange" type="float" min="0" value="1" label="Fold change" />
            <param name="pvalue" type="float" min="0" max="1" value="0.01" label="P-value Cutoff" help="P-value cutoff for the inference of differential gene expression" />
        </section>
    </inputs>
    <outputs>
        <data name="outpdf" format="pdf" label="${tool.name} on ${on_string}: PDF Report" />
        <data name="outrdat" format="rds" label="${tool.name} on ${on_string}: RDS" />
        <data name="outassignments" format="tabular" label="${tool.name} on ${on_string}: Cell-to-Cluster assignments" />
        <data name="outgenelist" format="tabular" label="${tool.name} on ${on_string}: Cluster - Genes per Cluster" />
        <data name="outlog" format="txt" label="${tool.name} on ${on_string}: Log" >
            <filter>use_log</filter>
        </data>
    </outputs>
    <tests>
        <test>
            <param name="inputrds" value="matrix.filter.rdat" />
            <output name="outgenelist" >
                <assert_contents>
                    <has_line_matching expression=".*ENSDARG00000108379\s+1.*" />
                    <has_line_matching expression=".*ENSDARG00000027310\s+8.*" />
                </assert_contents>
            </output>
            <output name="outrdat" value="matrix2.rdat" compare="sim_size" delta="40000" />
            <output name="outpdf" value="matrix2.pdf" compare="sim_size" delta="100" />
            <output name="outassignments" >
                <assert_contents>
                    <has_line_matching expression="P1_B1_ATCCAC\s1\s1\sFALSE" />
                </assert_contents>
            </output>
        </test>
        <test>
            <!-- defaults, but manually specified. No opts, no CC. Generates identical to above -->
            <param name="inputrds" value="matrix.filter.rdat" />
            <section name="clust" >
                <param name="metric" value="pearson" />
                <param name="funcluster" value="kmedoids" />
                <expand macro="test_nondef" >
                    <param name="fselect" value="true" />
                    <param name="sat" value="true" />
                    <param name="clustnr" value="30" />
                    <param name="bootnr" value="50" />
                    <param name="rseed" value="17000" />
                </expand>
            </section>
            <section name="outlier" >
                <param name="outminc" value="5" />
                <param name="outlg" value="2" />
                <param name="final" value="false" />
                <expand macro="test_nondef" section_name="outlier" >
                    <param name="probthr" value="0.001" />
                    <param name="outdistquant" value="0.95" />
                    <param name="rfseed" value="12345" />
                    <param name="nbfactor" value="5" />
                </expand>
            </section>
            <section name="tsne" >
                <param name="perplexity" value="30" />
                <param name="knn" value="10" />
                <expand macro="test_nondef" section_name="tsne" >
                    <param name="initial_cmd" value="true" />
                    <param name="rseed_tsne" value="15555" />
                    <param name="rfseed_fr" value="15555" />
                </expand>
            </section>
            <output name="outgenelist" >
                <assert_contents>
                    <has_line_matching expression=".*ENSDARG00000108379\s+1.*" />
                    <has_line_matching expression=".*ENSDARG00000027310\s+8.*" />
                </assert_contents>
            </output>
            <output name="outpdf" value="intestinal.pdf" compare="sim_size" delta="50" />
        </test>
        <test>
            <!-- Advanced. Opts, CC used  -->
            <param name="inputrds" value="matrix.filter.rdat" />
            <section name="clust" >
                <param name="metric" value="euclidean" />
                <param name="funcluster" value="hclust" />
                <expand macro="test_nondef" >
                    <param name="fselect" value="false" />
                    <param name="knn" value="5" />
                    <param name="sat" value="false" />
                    <param name="samp" value="10" />
                    <param name="cln" value="10" />
                    <param name="clustnr" value="10" />
                    <param name="bootnr" value="30" />
                    <param name="rseed" value="17000" />
                </expand>
            </section>
            <section name="outlier" >
                <param name="outminc" value="3" />
                <param name="outlg" value="5" />
                <param name="final" value="true" />
                <expand macro="test_nondef" >
                    <param name="probthr" value="0.01" />
                    <param name="outdistquant" value="0.5" />
                    <param name="rfseed" value="12345" />
                    <param name="nbfactor" value="5" />
                    <param name="nbtree" value="10" />
                </expand>
            </section>
            <section name="tsne" >
                <param name="perplexity" value="20" />
                <param name="knn" value="6" />
                <expand macro="test_nondef" >
                    <param name="initial_cmd" value="false" />
                    <param name="rseed_tsne" value="15555" />
                    <param name="rfseed_fr" value="15555" />
                </expand>
            </section>
            <output name="outgenelist" >
                <assert_contents>
                    <has_line_matching expression=".*ENSDARG00000107138\s+1.*" />
                    <has_line_matching expression=".*ENSDARG00000016138\s+10.*" />
                </assert_contents>
            </output>
            <output name="outpdf" value="intestinal_advanced.pdf" compare="sim_size" delta="150" />
        </test>
    </tests>
    <help><![CDATA[
RaceID3
=========

RaceID is a clustering algorithm for the identification of cell types from single-cell RNA-sequencing data. It was specifically designed for the detection of rare cells which correspond to outliers in conventional clustering methods.

This module performs clustering, and outlier detection and ultimately tells you how well defined your clusters are (even if the resultant tSNE plots look messy).

The tool generates the following:

 * A list of the most differentially expressed genes in each cluster
 * Cluster stability plots:
     * The mean within-cluster dispersion as a function of the cluster number and highlights the saturation point inferred based on the saturation criterion applied by RaceID3. The number of clusters where the change in within-cluster dispersion upon adding further clusters approaches linear behaviour demarcates the saturation point is highlighted in blue.
         * The point where this flattens out gives you a rough estimate of how many clusters there are in your analysis.
     * A scatter plot showing the gene expression variance as a function of the mean and the inferred polynomial fit of the background model, as well as a local regression.
        * This tells you which genes are the most significant against a background model of random expression.
     * A Jaccard stability plot which tells you how well defined your clusters are prior to outlier identification.
        * Good stable clusters should near the 0.8 mark, but 0.6 is acceptable, and for a large number of clusters, one or two low scoring clusters are also acceptable.
 * Heatmaps:
     * The initial and final clustering (as determined using random forest)
     * For each of the most differentially expressed genes in each cluster

The tool requires the RDS input from the previous filtering / normalisation / confounder removal step to work.


]]>
    </help>
    <expand macro="citations" />
</tool>
author	iuc
date	Tue, 05 Nov 2024 16:33:40 +0000
parents	0bff0ee0683a
children