Mercurial > repos > cafletezbrant > kmersvm
view kmersvm/README.txt @ 0:7fe1103032f7 draft
Uploaded
author | cafletezbrant |
---|---|
date | Mon, 20 Aug 2012 18:07:22 -0400 |
parents | |
children | fd740d515502 |
line wrap: on
line source
DEPENDENCIES: ************* KmerSVM requires the following software (to be installed in this order): Mac Users: 1. Xcode (Mac App Store) 2. Fortran compiler (http://gcc.gnu.org/wiki/GFortran/) Everyone: 1. Swig (http://www.swig.org; needed specifically to install python_modular package from Shogun Toolbox) 2. Numpy (numpy.scipy.org) 3. Shogun Toolbox, v0.9.3 - v1.10 (http://www.shogun-toolbox.org/) 4. Bitarray (http://pypi.python.org/pypi/bitarray/) 5. R (http://www.r-project.org) 6. ROCR R Package (Available through CRAN) Further, KmerSVM has been tested on Python 2.6, 2.7 on Linux and Mac OS X. At this time KmerSVM has not been tested on Windows. Note that for binaries are provided for Mac users. However, if difficulties in installation are encountered, it may be beneficial to compile the Fortran compiler from source. Additionally, be sure to add the location of your Shogun installation to the PYTHONPATH. REQUIRED FILES: *************** Use the install.sh script to install many required files. Specifically: sh run.sh /path/to/galaxy-dist/tools For efficient access to genome-wide data "Generate Null Sequence" and "Sequence Profiles" rely on access to binary files (indices) generated by using the script nullseq_build_indices.py. Download the *.tar or *.zip files for each genome to be analyzed. To create indices for a specific genome, call nullseq_build_indices.py. For example: python nullseq_build_indices.py mm8.zip mm8 Alternatively, we offer a handful of prepared index files, which should be downloaded and then extracted from our website (www.beerlab.org/kmersvm.html). Next, open the file tool-data/nullseq_indices.loc and add the path to the created indices following the instructions included in that file. For the genomes listed above, you would add the following lines to nullseq_indices.loc: mm8 Mouse(mm8) /path/to/nullseq_indice_mm8 mm9 Mouse(mm9) /path/to/nullseq_indices_mm9 hg18 Human(hg18) /path/to/nullseq_indices_hg18 hg19 Human(hg19) /path/to/nullseq_indices_hg19 To generate FASTA files for training or scoring purposes, kmer-SVM uses the built-in Galaxy tool "Fetch Sequences", which looks for genomes in *.nib or *.2bit format. Download genomes related to your data and update the tool-data/alignseq.loc file to include the location of these genomes according to directions in that file. FASTA files can also be provided by the user. "Fetch Sequences" should be set up as follows: Download 2bit files from the UCSC genome browser. For example, http://hgdownload.cse.ucsc.edu/goldenPath/mm8/bigZips/mm8.2bit http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/mm9.2bit http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit Add the following lines to galaxy-dist/tool-data/alignseq.loc seq mm8 /path/to/mm8.2bit seq mm9 /path/to/mm9.2bit seq hg18 /path/to/hg18.2bit seq hg19 /path/to/hg19.2bit TOOL_CONF.XML: ************** Add the following lines to tool_conf.xml: <section name="SVM Tools" id="kmersvm"> <tool file="kmersvm/classify.xml"/> <tool file="kmersvm/nullseq.xml"/> <tool file="kmersvm/rocprcurve.xml"/> <tool file="kmersvm/train.xml"/> <tool file="kmersvm/split_genome.xml"/> <tool file="kmersvm/seqprofile.xml" /> </section> Tool Tests: *********** Galaxy tools come with functional tests to determine if tools are operating correctly. To run tests on Galaxy tools, use the script run_functional_tests.sh. We offer tests for the tools "Train SVM", "Score Sequences of Interest" and "Split Genome". IDs for kmer-SVM tests can be found by calling run_functional_tests.sh with the '-list' flag. Non-Galaxy-Based Usage: *********************** The KmerSVM suite can be ran without using the Galaxy framework. Each tool exists as a standalone Python script (all located in /scripts) which can be called from the command line. Specific documentation can be found within each tool's Python file, or by calling the script with no arguments. A general workflow can be found in 'kmer-SVM: a Web-based Toolkit for the Computational Identification of Predictive Regulatory Sequence Features in Genomic Datasets', which can be followed by calling each of the relevant Python scripts, with the exception that users will have to provide needed FASTA files themselves. A simple worflow for the KmerSVM suite is as follows: 1. python nullseq_build_indices.py mm8.zip mm8 2. python nullseq_generate sample_input.bed mm8 /path/to/mm8/indices #This assumes no negative data sets. Output will need to be converted to FASTA. Skip if negative data is provided. 3. python kmersvm_train.py positive.fa negative.fa #Outputs will be WEIGHTS, PREDICTIONS 4. python split_genome.py input.bed #Skip if already have a list of regions you want to test. Output is test_seq.bed, which will need to be converted to FASTA. 5. python kmersvm_classify.py weights.out test_seq.fa Additionally, for any BED file, sequence composition (in terms of length, GC content and repeat fraction) can be obtained by calling 'make profile' as follows: python make_profile.py input.bed mm8 /path/to/mm8/indices profile.out Note that each tool has its own parameters, the manipulation of which allow the user to further customize their analysis. To learn more about a particular tool, simply call it without passing it any arguments.