Mercurial > repos > damion > versioned_data

diff README.md @ 1:5c5027485f7d draft
Uploaded correct file
author: damion
date: Sun, 09 Aug 2015 16:07:50 -0400
parents: d31a1bd74e63
--- a/README.md	Sun Aug 09 16:05:40 2015 -0400
+++ b/README.md	Sun Aug 09 16:07:50 2015 -0400
@@ -1,47 +1,52 @@
-Feature Frequency Profile Phylogenies
-=====================================
+# Versioned Data System
+The Galaxy and command line Versioned Data System manages the retrieval of current and past versions of selected reference sequence databases from local data stores.
 
+---
 
-Introduction
-------------
-
-FFP (Feature frequency profile) is an alignment free comparison tool for phylogenetic analysis and text comparison. It can be applied to nucleotide sequences, complete genomes, proteomes and even used for text comparison.  This software is a Galaxy (http://galaxyproject.org) tool for calculating FFP on one or more fasta sequence or text datasets.
-
-The original command line ffp-phylogeny code is at http://ffp-phylogeny.sourceforge.net/ .  This tool uses Aaron Petkau's modified version: https://github.com/apetkau/ffp-3.19-custom .  Aaron has quite a good writeup of the technique as well at https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny .
+0. Overview
+1. [Setup for Admins](doc/setup.md)
+  1. [Galaxy tool installation](doc/galaxy_tool_install.md)
+  2. [Server data stores](doc/data_stores.md)
+  3. [Data store examples](doc/data_store_examples.md)
+  4. [Galaxy "Versioned Data" library setup](doc/galaxy_library.md)
+  5. [Workflow configuration](doc/workflows.md)
+  6. [Permissions, security, and maintenance](doc/maintenance.md)
+  7. [Problem solving](doc/problem_solving.md)
+2. [Using the Galaxy Versioned data tool](doc/galaxy_tool.md)
+3. [System Design](doc/design.md)
+4. [Background Research](doc/background.md)
+5. [Server data store and galaxy library organization](doc/data_store_org.md)
+6. [Data Provenance and Reproducibility](doc/data_provenance.md)
+7. [Caching System](doc/caching.md)
 
-**Installation Note** : Your Galaxy server will need the groff package to be installed on it first (to generate ffp-phylogeny man pages).  A cryptic error will occur if it isn't: "troff: fatal error: can't find macro file s".  This is different from the "groff-base" package.
+---
+
+## Overview
+
+This tool can be used on a server both via the command line and via the Galaxy bioinformatics workflow platform using the "Versioned Data" tool.  Different kinds of content are suited to different archiving technologies, so the system provides a few  storage system choices.
 
-This Galaxy tool prepares a mini-pipeline consisting of **[ffpry | ffpaa | ffptxt] > [ ffpfilt | ffpcol > ffprwn] > ffpjsd > ffptree**  .  The last step is optional - by deselecting the "Generate Tree Phylogeny" checkbox, the tool will output a distance matrix rather than a Newick (.nhx) formatted tree file.
+* Fasta sequences - accession ids, descriptions and their sequences - are suited to storage as 1 line key-value pair records in a key-value store.  Here we introduce a low-tech file-based database plugin for this kind of data called **Kipper**.  It is  suited entirely to the goal of producing complete versioned files.  This covers much of the sequencing archiving problem for reference databases.  Consult https://github.com/Public-Health-Bioinformatics/kipper for up-to-date information on Kipper.
 
-Each sequence or text file has a profile containing tallies of each feature found.  A feature is a string of valid characters of given length. 
+* A **git** archiving system plugin is also provided for software file tree archiving, with a particular file differential (diff) compression benefit for documents that have sentence-like lines added and deleted between versions.  
+
+* Super-large files that are not suited to Kipper or git can be handled by a simple "**folder**" data store holds each version of file(s) in a separate compressed archive.
 
-For nucleotide data, by default each character (ATGC) is grouped as either purine(R) or pyrmidine(Y) before being counted.
+* **Biomaj** (our reference database maintenance software) can be configured to download and store separate version files.  A Biomaj plugin allows direct selection of versioned files within its "data bank" folders.
+
+The Galaxy Versioned Data tool below, shows the interface for retrieving versions of reference database.  The tool lets you select the fasta database to retrieve, and then one or more workflows.  The system then generates and caches the versioned data in the data library; then links it into one's history; then runs the workflow(s) to get the derivative data (a Blast database say) and then caches that back into the data library.  Future requests for that versioned data and derivatives (keyed by workflow id and input data version ids) will return the data already from cache rather than regenerating it, until the cache is deleted.
+
+![galaxy versioned data tool form](https://github.com/Public-Health-Bioinformatics/versioned_data/blob/master/doc/galaxy_tool_form.png)
 
-For amino acid data, by default each character is grouped into one of the following: (ST),(DE),(KQR),(IVLM),(FWY),C,G,A,N,H,P. Each group is represented by the first character in its series.
+## Project goals
 
-One other key concept is that a given feature, e.g. "TAA" is counted in forward AND reverse directions, mirroring the idea that a feature's orientation is not so important to distinguish when it comes to alignment-free comparison.  The counts for "TAA" and "AAT" are merged.
+* **To enable reproducible molecular biology research:** To recreate a search result at a certain point in time we need versioning so that search and mapping tools can look at reference sequence databases corresponding to a particular past date or version identifier.  This recall can also explain the difference between what was known in the past vs. currently.
+
+* **To reduce hard drive space.**  Some databases are too big to keep N copies around, e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + ....  (Compressing each file individually is an option but even better we could store just the differences between subsequent versions.)
  
-The labeling of the resulting counted feature items is perhaps the trickiest concept to master.  Due to computational efficiency measures taken by the developers, a feature that we see on paper as "TAC" may be stored and labeled internally as "GTA", its reverse compliment.  One must look for the alternative if one does not find the original. 
-
-Also note that in amino acid sequences the stop codon "*" (or any other character that is not in the Amino acid alphabet) causes that character frame not to be counted.  Also, character frames never span across fasta entries.
+* **Maximize speed of archive recall.**  Understanding that the archived version files can be large, we'd ideally like a versioned file to be retrieved in the time it takes to write a file of that size to disk.  Caching this data and its derivatives (makeblastdb databases for example) is important.
 
-A few tutorials:
- * http://sourceforge.net/projects/ffp-phylogeny/files/Documentation/tutorial.pdf
- * https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/ffp-phylogeny
-
--------
-**Note**
+* **Improve sequence archive management.** Provide an admin interface for managing regular scheduled import and log of reference sequence databases from our own and 3rd party sources like NCBI and cpndb.ca .
 
-Taxonomy label details: If each file contains one profile, the file's name is used to label the profile.  If each file contains fasta sequences to profile individually, their fasta identifiers will be used to label them.  The "short labels" option will find the shortest label that uniquely identifies each profile.  Either way, there are some quirks: ffpjsd clips labels to 10 characters if they are greater than 50 characters, so all labels are trimmed to 50 characters first.  Also "id" is prefixed to any numeric label since some tree visualizers won't show purely numeric labels.  In the accidental case where a Fasta sequence label is a duplicate of a previous one it will be prefixed by "DupLabel-".
-
-The command line ffpjsd can hang if one provides an l-mer length greater than the length of file content.  One must identify its process id ("ps aux | grep ffpjsd") and kill it ("kill [process id]").
-
-Finally, it is possible for the ffptree program to generate a tree where some of the branch distances are negative. See https://www.biostars.org/p/45597/
+* Integrate database versioning into the Galaxy workflow management software without adding a lot of complexity.
 
--------
-**References**
- 
-The development of the ffp-phylogeny command line software should be attributed to:
-
-Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 2009;106(8):2677-2682. doi:10.1073/pnas.0813249106.
-
+* A bonus would be to enable the efficient sharing of versioned data between computers/servers.
author	damion
date	Sun, 09 Aug 2015 16:07:50 -0400
parents	d31a1bd74e63
children