view pmids_to_pubtator_matrix.xml @ 2:ada4c7b3fc39 draft default tip

"planemo upload for repository https://github.com/dlal-group/simtext commit 63f67ee02be2eb4323a5ba5dcdd33d1fd0b7c24e"
author dlalgroup
date Thu, 08 Oct 2020 05:39:58 +0000
parents 3f4adc85ba5d
children
line wrap: on
line source

  <tool id="pmids_to_pubtator_matrix" name="pmids_to_pubtator_matrix" version="@VERSION@">
    <description>Per row, extract scientific terms from PMIDs susing PubTator and generate a binary matrix</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements">
      <requirement type="package" version="2.0.1">r-argparse</requirement>
      <requirement type="package" version="1.4.0">r-stringr</requirement>
      <requirement type="package" version="1.95_4.12">r-rcurl</requirement>
      <requirement type="package" version="1.4.3">r-stringi</requirement>
    </expand>
    <expand macro="stdio"/>
    <command><![CDATA[Rscript 
      '${__tool_directory__}/pmids_to_pubtator_matrix.R'
      --input '$input'
      --output '$output'
      --number '$number'
      $byid
      --categories 
      #for $category in $categories:
      '$category'
      #end for
      ]]>
    </command>
    <inputs>
      <param argument="--input" label="Input file" name="input" optional="false" type="data" format="tabular" help="input"/>
      <param argument="--categories" name="categories" type="select" label="categories" multiple="true" display="checkboxes">
        <option value="Gene">Genes</option>
        <option value="Disease">Diseases</option>
        <option value="Mutation">Mutations</option>
        <option value="Chemical">Chemicals</option>
        <option value="Species">Species</option>
      </param>
       <param argument="--byid" label="If you want to find common gene IDs / mesh IDs instead of specific scientific terms." name="byid" type="boolean" truevalue="--byid" falsevalue="" help="byid" checked="false"/>
        <param argument="--number" label="Number of most frequent terms/IDs to extract." name="number" optional="true" type="integer" help="number" value="50"/>
    </inputs>
    <outputs>
      <data format="tabular" name="output" />
    </outputs>
    <tests>
        <test>
          <param name="input" value="pubmed_by_queries_output" ftype="tabular"/>
          <output name="output" value="pmids_to_pubtator_matrix_output"/>
        </test>
    </tests>
    <help><![CDATA[

The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the corresponding abstracts, using PubTator annotations. The user can choose from which categories terms should be extracted. The extracted terms are united in one large binary matrix, with 0= term not present in abstracts of that row and 1= term present in abstracts of that row. The user can decide if the scientific terms should be extracted and used as they are or if they should be grouped by their geneIDs/ meshIDs (several terms are often grouped into one ID). The the user can specify a number of most frequent words to extract per row.

Input: Output of 'abstracts_by_pmids' tool, or tab-delimited table with columns containing PMIDs. The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc.

Output: Binary matrix in that each column represents one of the extracted terms.

        ]]></help>
    <expand macro="citations"/>
</tool>