annotate README.md @ 2:269d246ce6d0 draft default tip

Uploaded
author damion
date Fri, 23 Oct 2015 17:53:29 -0400
parents 5c5027485f7d
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
1 # Versioned Data System
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
2 The Galaxy and command line Versioned Data System manages the retrieval of current and past versions of selected reference sequence databases from local data stores.
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
3
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
4 ---
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
5
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
6 0. Overview
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
7 1. [Setup for Admins](doc/setup.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
8 1. [Galaxy tool installation](doc/galaxy_tool_install.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
9 2. [Server data stores](doc/data_stores.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
10 3. [Data store examples](doc/data_store_examples.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
11 4. [Galaxy "Versioned Data" library setup](doc/galaxy_library.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
12 5. [Workflow configuration](doc/workflows.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
13 6. [Permissions, security, and maintenance](doc/maintenance.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
14 7. [Problem solving](doc/problem_solving.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
15 2. [Using the Galaxy Versioned data tool](doc/galaxy_tool.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
16 3. [System Design](doc/design.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
17 4. [Background Research](doc/background.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
18 5. [Server data store and galaxy library organization](doc/data_store_org.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
19 6. [Data Provenance and Reproducibility](doc/data_provenance.md)
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
20 7. [Caching System](doc/caching.md)
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
21
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
22 ---
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
23
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
24 ## Overview
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
25
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
26 This tool can be used on a server both via the command line and via the Galaxy bioinformatics workflow platform using the "Versioned Data" tool. Different kinds of content are suited to different archiving technologies, so the system provides a few storage system choices.
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
27
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
28 * Fasta sequences - accession ids, descriptions and their sequences - are suited to storage as 1 line key-value pair records in a key-value store. Here we introduce a low-tech file-based database plugin for this kind of data called **Kipper**. It is suited entirely to the goal of producing complete versioned files. This covers much of the sequencing archiving problem for reference databases. Consult https://github.com/Public-Health-Bioinformatics/kipper for up-to-date information on Kipper.
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
29
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
30 * A **git** archiving system plugin is also provided for software file tree archiving, with a particular file differential (diff) compression benefit for documents that have sentence-like lines added and deleted between versions.
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
31
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
32 * Super-large files that are not suited to Kipper or git can be handled by a simple "**folder**" data store holds each version of file(s) in a separate compressed archive.
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
33
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
34 * **Biomaj** (our reference database maintenance software) can be configured to download and store separate version files. A Biomaj plugin allows direct selection of versioned files within its "data bank" folders.
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
35
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
36 The Galaxy Versioned Data tool below, shows the interface for retrieving versions of reference database. The tool lets you select the fasta database to retrieve, and then one or more workflows. The system then generates and caches the versioned data in the data library; then links it into one's history; then runs the workflow(s) to get the derivative data (a Blast database say) and then caches that back into the data library. Future requests for that versioned data and derivatives (keyed by workflow id and input data version ids) will return the data already from cache rather than regenerating it, until the cache is deleted.
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
37
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
38 ![galaxy versioned data tool form](https://github.com/Public-Health-Bioinformatics/versioned_data/blob/master/doc/galaxy_tool_form.png)
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
39
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
40 ## Project goals
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
41
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
42 * **To enable reproducible molecular biology research:** To recreate a search result at a certain point in time we need versioning so that search and mapping tools can look at reference sequence databases corresponding to a particular past date or version identifier. This recall can also explain the difference between what was known in the past vs. currently.
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
43
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
44 * **To reduce hard drive space.** Some databases are too big to keep N copies around, e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + .... (Compressing each file individually is an option but even better we could store just the differences between subsequent versions.)
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
45
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
46 * **Maximize speed of archive recall.** Understanding that the archived version files can be large, we'd ideally like a versioned file to be retrieved in the time it takes to write a file of that size to disk. Caching this data and its derivatives (makeblastdb databases for example) is important.
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
47
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
48 * **Improve sequence archive management.** Provide an admin interface for managing regular scheduled import and log of reference sequence databases from our own and 3rd party sources like NCBI and cpndb.ca .
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
49
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
50 * Integrate database versioning into the Galaxy workflow management software without adding a lot of complexity.
0
d31a1bd74e63 Uploaded first version
damion
parents:
diff changeset
51
1
5c5027485f7d Uploaded correct file
damion
parents: 0
diff changeset
52 * A bonus would be to enable the efficient sharing of versioned data between computers/servers.