annotate README.rst @ 1:f8dee15a72a4 draft

Uploaded
author pedro_araujo
date Wed, 27 Jan 2021 14:52:31 +0000
parents e4b3fc88efe0
children 8674f554d76b
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
1
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
2 PhageHostPrediction
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
3 ===============
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
4
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
5 Predict interactions between phages and bacterial strains.
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
6
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
7 PhageHostPrediction is a python script that predicts phage-host interactions for *E. coli*, *K. pneumoniae* and *A. baumannii* phages, using supervised machine learning models. The models were built from a dataset containing 252 features and 23 987 entries with balanced outputs of 'Yes' and 'No'. The positive cases of interaction predicted are described in the file "NCBI_Phage_Bacteria_Data.csv", contained within this tool, while the negative were randomly assigned by pairing phages with bacteria of different species.
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
8
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
9 The prediction resorts to complete host proteome and to phage tail proteins, that are inferred within the tool. This inference is made with a locally created database of phage protein functions, available in the file "phagesProteins.json". Unknown proteins are predicted against this database. To help with this prediction, the use of InterProScan is made optional.
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
10
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
11 **Inputs:**
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
12
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
13 * phage/bacteria genome format: ID vs fasta;
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
14 * ID: must be a GenBank ID, with the proteome described;
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
15 * fasta file: must contain the whole proteome of the organism;
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
16 * machine learning model: random forests have better predictive power, while SVM can be slightly faster to run;
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
17 * interpro search: should predict tails with higher confidence, but it significantly increases time to run.
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
18
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
19 **Outputs:**
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
20 This tool outputs a tabular file in which phage-host pairs are present in the first column and the prediction result in the second.
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
21
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
22 **Requirements:**
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
23
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
24 * Biopython
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
25 * Scikit-learn
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
26 * Numpy
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
27 * Pandas
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
28 * Scikit-bio
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
29 * BLAST_ - must be installed locally and available globally as an environment variable
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
30 * InterProScan_ (optional) - must be installed locally and available globally as an environment variable
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
31
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
32 .. _BLAST: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
e4b3fc88efe0 Uploaded
pedro_araujo
parents:
diff changeset
33 .. _InterProScan: http://www.ebi.ac.uk/interpro/download/