Mercurial > repos > nilesh > rseqc

<tool id="rseqc_RPKM_saturation" name="RPKM Saturation" version="@TOOL_VERSION@.2">
    <description>calculates raw count and RPKM values for transcript at exon, intron, and mRNA level</description>
    <expand macro="bio_tools"/>
    <macros>
        <import>rseqc_macros.xml</import>
    </macros>

    <expand macro="requirements" />

    <expand macro="stdio" />

    <version_command><![CDATA[RPKM_saturation.py --version]]></version_command>

    <command><![CDATA[
        RPKM_saturation.py -i '${input}' -o output -r '${refgene}'

        #if str($strand_type.strand_specific) == "pair"
            -d
            #if str($strand_type.pair_type) == "sd"
                '1++,1--,2+-,2-+'
            #else
                '1+-,1-+,2++,2--'
            #end if
        #end if

        #if str($strand_type.strand_specific) == "single"
            -d
            #if str($strand_type.single_type) == "s"
                '++,--'
            #else
                '+-,-+'
            #end if
        #end if

        -l ${percentileFloor} -u ${percentileCeiling} -s ${percentileStep} -c ${rpkmCutoff}
        --mapq $mapq
    ]]></command>

    <inputs>
        <expand macro="bam_param" />
        <expand macro="refgene_param" />
        <expand macro="strand_type_param" />
        <param name="percentileFloor" type="integer" value="5" label="Begin sampling from this percentile (default=5)" help="(--percentile-floor)"/>
        <param name="percentileCeiling" type="integer" value="100" label="End sampling at this percentile (default=100)" help="(--percentile-ceiling)" />
        <param name="percentileStep" type="integer" value="5" label="Sampling step size (default=5)" help="(--percentile-step)" />
        <param name="rpkmCutoff" type="text" value="0.01" label="Ignore transcripts with RPKM smaller than this number (default=0.01)" help="(--rpkm-cutoff)" />
        <expand macro="mapq_param" />
        <expand macro="rscript_output_param" />
    </inputs>

    <outputs>
        <expand macro="pdf_output_data" filename="output.saturation.pdf" />
        <data format="xls" name="outputxls" from_work_dir="output.eRPKM.xls" label="${tool.name} on ${on_string} (RPKM xls)"/>
        <data format="xls" name="outputrawxls" from_work_dir="output.rawCount.xls" label="${tool.name} on ${on_string} (Raw Count xls)"/>
        <expand macro="rscript_output_data" filename="output.saturation.r" />
    </outputs>

    <tests>
        <test>
            <param name="input" value="pairend_strandspecific_51mer_hg19_random.bam"/>
            <param name="refgene" value="hg19.HouseKeepingGenes_30.bed"/>
            <param name="rscript_output" value="true" />
            <output name="outputxls">
              <assert_contents>
                <has_n_columns n="26" />
                <has_line_matching expression="chr1\t16174358\t16266950\tNM_015001.*" />
              </assert_contents>
            </output>
            <output name="outputrawxls">
              <assert_contents>
                <has_n_columns n="26" />
                <has_line_matching expression="chr1\t16174358\t16266950\tNM_015001.*" />
              </assert_contents>
            </output>
            <output name="outputr">
              <assert_contents>
                <has_text text="pdf('output.saturation.pdf')" />
                <has_line_matching expression="S5=c\(\d+\.\d+\)" />
              </assert_contents>
            </output>
            <output name="outputpdf" file="output.saturation.pdf" compare="sim_size" />
        </test>
    </tests>

    <help><![CDATA[
RPKM_saturation.py
++++++++++++++++++

The precision of any sample statitics (RPKM) is affected by sample size (sequencing depth);
\'resampling\' or \'jackknifing\' is a method to estimate the precision of sample statistics by
using subsets of available data. This module will resample a series of subsets from total RNA
reads and then calculate RPKM value using each subset. By doing this we are able to check if
the current sequencing depth was saturated or not (or if the RPKM values were stable or not)
in terms of genes' expression estimation. If sequencing depth was saturated, the estimated
RPKM value will be stationary or reproducible. By default, this module will calculate 20
RPKM values (using 5%, 10%, ... , 95%,100% of total reads) for each transcripts.

In the output figure, Y axis is "Percent Relative Error" or "Percent Error" which is used
to measures how the RPKM estimated from subset of reads (i.e. RPKMobs) deviates from real
expression level (i.e. RPKMreal). However, in practice one cannot know the RPKMreal. As a
proxy, we use the RPKM estimated from total reads to approximate RPKMreal.

.. image:: $PATH_TO_IMAGES/RelativeError.png
   :height: 80 px
   :width: 400 px
   :scale: 100 %

Inputs
++++++++++++++

Input BAM/SAM file
    Alignment file in BAM/SAM format.

Reference gene model
    Gene model in BED format.

Strand sequencing type (default=none)
    See Infer Experiment tool if uncertain.

Options
++++++++++++++

Skip Multiple Hit Reads
    Use Multiple hit reads or use only uniquely mapped reads.

Only use exonic reads
    Renders program only used exonic (UTR exons and CDS exons) reads, otherwise use all reads.

Output
++++++++++++++

1. output..eRPKM.xls: RPKM values for each transcript
2. output.rawCount.xls: Raw count for each transcript
3. output.saturation.r: R script to generate plot
4. output.saturation.pdf:

.. image:: $PATH_TO_IMAGES/saturation.png
   :height: 600 px
   :width: 600 px
   :scale: 80 %

- All transcripts were sorted in ascending order according to expression level (RPKM). Then they are divided into 4 groups:
    1. Q1 (0-25%): Transcripts with expression level ranked below 25 percentile.
    2. Q2 (25-50%): Transcripts with expression level ranked between 25 percentile and 50 percentile.
    3. Q3 (50-75%): Transcripts with expression level ranked between 50 percentile and 75 percentile.
    4. Q4 (75-100%): Transcripts with expression level ranked above 75 percentile.
- BAM/SAM file containing more than 100 million alignments will make module very slow.
- Follow example below to visualize a particular transcript (using R console)::

    pdf("xxx.pdf")     #starts the graphics device driver for producing PDF graphics
    x &lt;- seq(5,100,5)  #resampling percentage (5,10,15,...,100)
    rpkm &lt;- c(32.95,35.43,35.15,36.04,36.41,37.76,38.96,38.62,37.81,38.14,37.97,38.58,38.59,38.54,38.67, 38.67,38.87,38.68,  38.42,  38.23)  #Paste RPKM values calculated from each subsets
    scatter.smooth(x,100*abs(rpkm-rpkm[length(rpkm)])/(rpkm[length(rpkm)]),type="p",ylab="Precent Relative Error",xlab="Resampling Percentage")
    dev.off()          #close graphical device

.. image:: $PATH_TO_IMAGES/saturation_eg.png
   :height: 600 px
   :width: 600 px
   :scale: 80 %

@ABOUT@

]]>
    </help>

    <expand macro="citations" />

</tool>
author	iuc
date	Sat, 18 Dec 2021 19:41:19 +0000
parents	5873cd7afb67
children	1421603cc95b