view galaxy_stubs/FingerprintSimilarityClustering.xml @ 2:605370bc1def draft default tip

Uploaded
author luis
date Tue, 12 Jul 2016 12:33:33 -0400
parents
children
line wrap: on
line source

<?xml version='1.0' encoding='UTF-8'?>
<!--This is a configuration file for the integration of a tools into Galaxy (https://galaxyproject.org/). This file was automatically generated using CTD2Galaxy.-->
<!--Proposed Tool Section: [Chemoinformatics]-->
<tool id="FingerprintSimilarityClustering" name="FingerprintSimilarityClustering" version="1.1.0">
  <description>fast clustering of compounds using 2D binary fingerprints</description>
  <macros>
    <token name="@EXECUTABLE@">FingerprintSimilarityClustering</token>
    <import>macros.xml</import>
  </macros>
  <expand macro="stdio"/>
  <expand macro="requirements"/>
  <command>FingerprintSimilarityClustering

#if $param_t:
  -t $param_t
#end if
#if $param_f:
  -f $param_f
#end if
#if $param_fp_col:
  -fp_col $param_fp_col
#end if
#if $param_id_col:
  -id_col $param_id_col
#end if
#if $param_fp_tag:
  -fp_tag     "$param_fp_tag"
#end if
#if $param_id_tag:
  -id_tag     "$param_id_tag"
#end if
#if $param_tc:
  -tc $param_tc
#end if
#if $param_cc:
  -cc $param_cc
#end if
#if $param_l:
  -l $param_l
#end if
#if $param_nt:
  -nt     "$param_nt"
#end if
#if $param_sdf_out:
  -sdf_out $param_sdf_out
#end if
</command>
  <inputs>
    <param name="param_t" type="data" format="smi.gz,csv,sdf.gz,sdf,txt.gz,smi,txt,csv.gz" optional="False" value="&lt;class 'CTDopts.CTDopts._Null'&gt;" label="Target library input file" help="(-t) "/>
    <param name="param_f" type="integer" min="1" max="2" optional="False" value="0" label="Fingerprint format [1 = binary bitstring, 2 = comma separated feature list]" help="(-f) "/>
    <param name="param_fp_col" type="integer" value="-1" label="Column number for comma separated smiles input which contains the fingerprint" help="(-fp_col) "/>
    <param name="param_id_col" type="integer" value="-1" label="Column number for comma separated smiles input which contains the molecule identifie" help="(-id_col) "/>
    <param name="param_fp_tag" type="text" size="30" value=" " label="Tag name for SDF input which contains the fingerprint" help="(-fp_tag) ">
      <sanitizer>
        <valid initial="string.printable">
          <remove value="'"/>
          <remove value="&quot;"/>
        </valid>
      </sanitizer>
    </param>
    <param name="param_id_tag" type="text" size="30" value=" " label="Tag name for SDF input which contains the molecule identifie" help="(-id_tag) ">
      <sanitizer>
        <valid initial="string.printable">
          <remove value="'"/>
          <remove value="&quot;"/>
        </valid>
      </sanitizer>
    </param>
    <param name="param_tc" type="float" value="0.7" label="Tanimoto cutoff [default: 0.7]" help="(-tc) "/>
    <param name="param_cc" type="integer" value="1000" label="Clustering size cutoff [default: 1000]" help="(-cc) "/>
    <param name="param_l" type="integer" value="0" label="Number of fingerprints to read" help="(-l) "/>
    <param name="param_nt" type="text" size="30" value="1" label="Number of parallel threads to use" help="(-nt) To use all possible threads enter &lt;max&gt; [default: 1]">
      <sanitizer>
        <valid initial="string.printable">
          <remove value="'"/>
          <remove value="&quot;"/>
        </valid>
      </sanitizer>
    </param>
    <param name="param_sdf_out" type="integer" min="0" max="1" optional="True" value="0" label="If input file has SD format, this flag activates writing of clustering information as new tags in a copy of the input SD file" help="(-sdf_out) "/>
  </inputs>
  <expand macro="advanced_options"/>
  <outputs>
    <data name="param_stdout" format="text" label="Output from stdout"/>
  </outputs>
  <help>This tool performs a fast and deterministic semi-hierarchical clustering of input compounds encoded as 2D binary fingerprints.

The method is a multistep workflow which first reduces the number of input fingerprints by removing duplicates. This unique set is forwarded to connected
components decomposition by calculating all pairwise Tanimoto similarities and application of a similarity cutoff value. As a third step, all connected components
which exceed a predefined size are hierarchically clustered using the average linkage clustering criterion. The Kelley method is applied on every hierarchical clustering
to determine a level for cluster selection. Finally, the fingerprint duplicates are remapped onto the final clusters which contain their representatives. 

For every final cluster a medoid is calulated. For a single cluster multiple medoids are possible because fingerprint duplicates of a medoid are also marked as medoid.

For every compound the output yields a cluster ID, a medoid tag where '1' indicates the cluster medoid(s) and the average similarity of the compound to all other 
cluster members. If the output format is SD, these properties are added as new tags.

======================================================================================================================================================

Examples:

$ FingerprintSimilarityClustering -t target.sdf -fp_tag FPRINT -f 1 -id_tag NAME
  tries to read fingerprints as binary bitstrings (-f 1) from tag &lt;FPRINT&gt; and compound IDs from tag &lt;NAME&gt; of target.sdf input file.
  The clustering workflow described is executed on the input molecules with default values.

$ FingerprintSimilarityClustering -t target.csv -fp_col 3 -f 2 -id_col 1
  tries to read fingerprints as comma separated integer feature list (-f 2) from column 3 and IDs from column 1 out of a space separated CSV file.
  The clustering workflow described is executed on the input molecules with default values.

$ FingerprintSimilarityClustering -t target.sdf -fp_tag FPRINT -f 1 -id_tag NAME -nt max
  Same as first example but executed in parallel mode using as many threads as available.

$ FingerprintSimilarityClustering -t target.sdf -fp_tag FPRINT -f 1 -id_tag NAME -tc 0.5 -cc 50
  Same as first example but using modified parameters for similarity network generation (tc 0.5) and size of connected components to be clustered (-cc 50).

</help>
</tool>