view mahalanobis_distance.xml @ 1:2e7d47c0b027 draft

"planemo upload for repository https://malex@toolshed.g2.bx.psu.edu/repos/malex/secimtools"
author malex
date Mon, 08 Mar 2021 22:04:06 +0000
parents
children
line wrap: on
line source

<tool id="secimtools_mahalanobis_distance" name="Penalized Mahalanobis Distance (PMD)" version="@WRAPPER_VERSION@">
    <description>to compare groups</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <stdio>
        <exit_code range="1:" level="warning" description="RuntimeWarning"/>
    </stdio>
    <command detect_errors="exit_code"><![CDATA[
mahalanobis_distance.py
--input $input
--design $design
--ID $uniqID
--figure $plot
--distanceToMean $out1
--distancePairwise $out2

#if $group
    --group $group
#end if

#if $levels
    --levels $levels
#end if

#if $p
    --per $p
#end if

#if $order
    --order $order
#end if

#if $penalty
    --penalty $penalty
#end if
    ]]></command>
    <inputs>
        <param name="input" type="data" format="tabular" label="Wide Dataset" help="Input your tab-separated wide format dataset. If file not tab separated see TIP below."/>
        <param name="design" type="data" format="tabular" label="Design File" help="Input your design file (tab-separated). Note you need a 'sampleID' column. If not tab separated see TIP below."/>
        <param name="uniqID" type="text" size="30" value="" label="Unique Feature ID" help="Name of the column in your wide dataset that has unique identifiers.."/>
        <param name="group" type="text" size="30" label="Group/Treatment [Optional]" help="Name of the column in your design file that contains group classifications." />
        <param name="order" type="text" size="30" label="Input Run Order Name [Optional]" help="Enter the name of the column containing the order samples were run. Spelling and capitalization must be exact." />
        <param name="levels" type="text" size="30" label="Additional groups to separate by [Optional]" help="Enter additional group(s) name(s) to include. Spelling and capitalization must be exact. If more than one group separate with ','." />
        <param name="p" type="float" value= ".95" size="6" label="Threshold" help="Threshold for standard distribution, specified as a percentile. Default = 0.95." />
        <param name="penalty" type="float" value= "0.5" size="6" label="λ Penalty" help="λ Penalty to use in the distance. The default is λ=0.5." />
    </inputs>
    <outputs>
        <data format="pdf" name="plot" label="${tool.name} on ${on_string}: plot" />
        <data format="tabular" name="out1" label="${tool.name} on ${on_string}: toMean" />
        <data format="tabular" name="out2" label="${tool.name} on ${on_string}: pairwise" />
    </outputs>
    <tests>
        <test>
            <param name="input"   value="ST000006_data.tsv"/>
            <param name="design"  value="ST000006_design.tsv"/>
            <param name="uniqID"  value="Retention_Index" />
            <param name="group"   value="White_wine_type_and_source" />
            <param name="penalty" value="0.5" />
            <output name="plot"   file="ST000006_mahalanobis_distance_figure.pdf" compare="sim_size" delta="10000" />
            <output name="out1"   file="ST000006_mahalanobis_distance_to_mean.tsv" />
            <output name="out2"   file="ST000006_mahalanobis_distance_pairwise.tsv" />
        </test>
    </tests>
    <help><![CDATA[

@TIP_AND_WARNING@

**Tool Description**

The Penalized Mahalanobis distance (PMD) tool can be used to compare samples within a group and accounts for the correlation structure between metabolites.
In contrast, Standardized Euclidian distance (SED) relies solely on geometric distance and ignores any dependency structures between features.
PMD incorporates the correlation structure inside the distance measurement.

When correlation structure and dependency between metabolites is ignored, the features inverse variance-covariance matrix simplifies to a diagonal matrix with diagonal values - in this case, MD simplifies to SED.
When the number of features is greater than the number of samples, the inverse of the features variance-covariance matrix does not exist.
This is the case for most -omic data. Here, the inverse is estimated using a regularization method (Archambeau et al. 2004).
The details of the regularization algorithm can be found in Supplementary file 3 in Kirpich et al. 2017.

Archambeau C, Vrins F, Verleysen M. Flexible and Robust Bayesian Classification by Finite Mixture Models. InESANN 2004 (pp. 75-80).​

**NOTE:** Because of the nature of the tool, groups with less than 3 samples will be discarded from the analysis.


**Input**

    - Two input datasets are required.

@WIDE@

**NOTE:** The sample IDs must match the sample IDs in the Design File
(below). Extra columns will automatically be ignored.

@METADATA@

@UNIQID@

@GROUP_OPTIONAL@

    - **Warning:** All groups must contain 3 or more samples.


@RUNORDER_OPTIONAL@

**Additional groups to separate by [Optional]**

    - Enter additional group(s) name(s) to include. Spelling and capitalization must be exact. If more than one group, separate them with a comma
    - **Warning:** All groups must contain 3 or more samples.


**Percentile cutoff**

- The percentile cutoff for standard distributions. The default is 0.95.

**λ Penalty**

- λ Penalty to use in the distance. The default is λ=0.5.

--------------------------------------------------------------------------------

**Output**

The tool outputs three different files:

(1) a PDF file containing 2D scatter plots and boxplots for the distances 

(2) a TSV file containing distances from the sample to the estimated mean 

(3) a TSV file containing distances from the sample to other samples.

If the grouping variable is specified by the user, the distances are computed both within the groups and for the entire dataset.

    ]]></help>
    <expand macro="citations"/>
</tool>